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COVARIANCE ASSISTED SCREENING AND 
ESTIMATION 

By Tracy Ke*'" 1 ", Jiashun Jin" 1 " and Jianqing Fan* 
Princeton University and Carnegie Mellon University 

Consider a linear model Y = Xf3 + z, where X = X„ :P and 
z ~ N(0, I n ). The vector /3 is unknown and it is of interest to separate 
its nonzero coordinates from the zero ones (i.e., variable selection). 
Motivated by examples in long-memory time series [11] and change- 
point problem [2], we are primarily interested in the case where the 
Gram matrix G = X'X is non-sparse but sparsifiable by a finite order 
linear filter. We focus on the regime where signals are both rare and 
weak so that successful variable selection is very challenging but is 
still possible. 

We approach this problem by a new procedure called the Covari- 
ance Assisted Screening and Estimation (CASE). CASE first uses 
a linear filtering to reduce the original setting to a new regression 
model where the corresponding Gram (covariance) matrix is sparse. 
The new covariance matrix induces a sparse graph, which guides us to 
conduct multivariate screening without visiting all the submodels. By 
interacting with the signal sparsity, the graph enables us to decom- 
pose the original problem into many separated small-size subprob- 
lems (if only we know where they are!). Linear filtering also induces 
a so-called problem of information leakage, which can be overcome 
by the newly introduced patching technique. Together, these give rise 
to CASE, which is a two-stage Screen and Clean [10, 32] procedure, 
where we first identify candidates of these submodels by patching 
and screening, and then re-examine each candidate to remove false 
positives. 

For any procedure (3 for variable selection, we measure the perfor- 
mance by the minimax Hamming distance between the sign vectors 
of ft and /3. We show that in a broad class of situations where the 
Gram matrix is non-sparse but sparsifiable, CASE achieves the op- 
timal rate of convergence. The results are successfully applied to a 
long-memory time series model and a change-point model. 
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1. Introduction. Consider a linear regression model 

(1.1) Y = X(3 + z, X = X n , p , z~N{0,I n ). 

The vector /3 is unknown but is sparse, in the sense that only a small fraction 
of its coordinates is nonzero. The goal is to separate the nonzero coordinates 
of f3 from the zero ones (i.e., variable selection). 

We are primarily interested in the case where the Gram matrix G = 
X'X is non-sparse but sparsifiable. We call G sparse if each of its rows 
has relatively few 'large' elements, and we call G sparsifiable if G can be 
reduced to a sparse matrix by some simple operations (e.g. linear filtering 
or low-rank matrix removal) . The Gram matrix plays a critical role in sparse 
inference, as the sufficient statistics X'Y ~ N(Gf3, G). Examples where G is 
non-sparse but sparsifiable can be found in the following application areas. 

• Change-point problem. Recently, driven by researches on DNA copy 
number variation, this problem has received a resurgence of interest 
[18, 24, 25, 30]. While existing literature focuses on detecting change- 
points, locating change-points is also of major interest in many appli- 
cations [1, 28, 34]. Consider a change-point model 

(1.2) Y i = 6 i + z l , Zi~N(0,l), l<i< P , 

where 9 = (9%, . . . ,9 P )' is a piece- wise constant vector with jumps 
at relatively few locations. Let X = X P)P be the matrix such that 
X(i,j) = l{j > i}, 1 < i, j < p. We re-parametrize the parameters by 

9 = Xf3, where j3k = 9^ — #fc+i, 1 < k < p — 1, and j3 p = 9 P , 

so that f3k is nonzero if and only if 9 has a jump at location k. The 
Gram matrix G has elements G(i,j) = min{i,j}, which is evidently 
non-sparse. However, adjacent rows of G display a high level of sim- 
ilarity, and the matrix can be sparsified by a second order adjacent 
differencing between the rows. 

• Long-memory time series. We consider using time-dependent data to 
build a prediction model for variables of interest: 

Y t = J2^X t ^+e t , 

j 

where {A^} is an observed stationary time series and {e^} are white 
noise. In many applications, {Xf} is a long-memory process. Examples 
include volatility process [11, 27], exchange rates, electricity demands, 
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and river's outflow (e.g. the Niles). Note that the problem can be refor- 
mulated as (1.1), where the Gram matrix G = X'X is asymptotically 
close to the auto-covariance matrix of {Xt} (say, fi). It is well-known 
that Q is Toeplitz, the off-diagonal decay of which is very slow, and 
the matrix L 1 -norm of which diverges as p — > oo. However, the Gram 
matrix can be sparsified by a first order adjacent differencing between 
the rows. 

Further examples include jump detections in (logarithm) asset prices and 
time series following a FARIMA model [11]. Still other examples include 
the factor models, where G can be decomposed as the sum of a sparse 
matrix and a low rank (positive semi-definite) matrix. In these examples, G 
is non-sparse, but it can be sparsified either by adjacent row differencing or 
low-rank matrix removal. 

In this paper, motivated by the above examples, we are primarily inter- 
ested in the case where G is non-sparse but can be sparsified by a finite-order 
linear filtering. However, the idea developed in the paper applies to much 
broader settings, where G can be sparsified by some other methods rather 
than linear filtering. 

When G is non-sparse, many existing variable selection methods face chal- 
lenges. Take the lasso [5, 7, 29] for example. The success of the lasso is hinged 
on the so-called irrepresentable condition [35], which usually does not hold 
in the current setting as the columns of X are strongly dependent. Similar 
conclusion can be drawn for other popular approaches, such as the SCAD 
[9] (despite that conditions for its success are far less stringent than those 
of the lasso) and the Dantzig selector [4]. 

In this paper, we propose a new variable selection method which we call 
Covariance Assisted Screening and Estimation (CASE). The main method- 
ological innovation of CASE is to exploit the rich information hidden in the 
'local' graphical structures among the design variables, which the lasso and 
many other procedures do not utilize. 

In the core of CASE is covariance assisted multivariate screening. Screen- 
ing is a well-known method of dimension reduction in Big Data. However, 
most literature to date has been focused on univariate screening or marginal 
screening [10, 15]. The major concern for extending marginal screening to 
(brute- force) m-variate screening, m > 1, is the computational cost. The 
computational complexity is at least 0(p m ) (excluding the complexity for 
obtaining X'Y from (X, Y); same below), which is usually unaffordable in 
high-dimensional problems. CASE screens only models that has < m nodes 
and that form a connected subgraph of GOSD (a graph to be introduced 
below). As a result, in a broad context, CASE only has a computational 
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cost of p, up to a factor of multi-log (p), and so overcomes the computational 
challenge. 

We show that CASE achieves asymptotic minimaxity of Hamming dis- 
tance, in the very challenging regime where the signals are both rare and 
weak; that is, only a small fraction of the coordinates of (3 is nonzero, and 
each nonzero coordinate is relatively small. See for example [31] and the ref- 
erences therein. Many recent works on variable selection focus on the regime 
where the signals are rare but strong, and usually the probability of exact 
support recovery or the oracle property is used to assess the optimality of a 
procedure f3. When signals are both rare and weak, exact support recovery 
is usually impossible, and the Hamming distance — which measures the num- 
ber of coordinates at which the sign vectors of {5 and j3 disagree — is a more 
natural criterion for assessing optimality. Compared to many recent works, 
the theoretic framework developed in this paper is not only technically more 
challenging, but also scientifically more relevant. 

Below, first, in Section 1.1, we introduce the Rare and Weak signal model. 
We then formally introduce the notion of sparsifiability in Section 1.2. The 
starting point of CASE is the use of a linear filter. In Section 1.3, we explain 
how linear filtering helps in variable selection by simultaneously maintaining 
signal sparsity and yielding the covariance matrix nearly block diagonal. In 
Section 1.4, we explain that linear filtering also causes a so-called problem of 
information leakage, and how to overcome such a problem by the technique 
of patching. After all these ideas are discussed, we formally introduce the 
CASE in Section 1.5. The computational complexity, theoretic properties, 
and applications of CASE are investigated in Sections 1.6-1.11. 

1.1. Rare and Weak signal model. Our primary interest is in the situa- 
tions where the signals are rare and weak, and where we have no information 
on the underlying structure of the signals. In such situations, it makes sense 
to use the following Rare and Weak signal model; see [3, 8, 20]. Fix e G (0, 1) 
and t > 0. Let b = (b\, . . . , b p )' be the pxl vector satisfying 



(1.3) 




iid 



and let ® p (t) be the set of vectors 



(1.4) 



9 p (r) = {n <E R p : \m\ >r,l<i<p}. 



We model f3 by 



(1.5) 



P = bo /i, 
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where fi G © p (t) and o is the Hadamard product (also called the coordinate- 
wise product). In Section 1.7, we further restrict fi to a subset of © p (r). 

In this model, /3, is either or a signal with a strength > r. Since we 
have no information on where the signals are, we assume that they appear 
at locations that are randomly generated. We are primarily interested in the 
challenging case where e is small and r is relatively small, so the signals are 
both rare and weak. 

Definition 1.1. We call Model (1.3)-(1.5) the Rare and Weak signal 
model RW(e, r, /x). 

We remark that the theory developed in this paper is not tied to the Rare 
and Weak signal model, and applies to more general cases. For example, the 
main results can be extended to the case where we have some additional 
information about the underlying structure of the signals (e.g. Ising model 
[17])- 

1.2. Sparsifiability, linear filtering, and GOSD. As mentioned before, we 
are primarily interested in the case where the Gram matrix G can be spar- 
sified by a finite-order linear filtering. 

Fix an integer h > 1 and an (/i+l)-dimensional vector rj = (1,771, ... , r]h)' ■ 
Let D = Dh be the p x p matrix satisfying 
(1.6) 

Dh, v (i,j) = Hi = j} + ViHi =j-l} + ... + r] h l{i =j-h}, 1 < i,j < p. 

The matrix D^ ^ can be viewed as a linear operator that maps any p x 1 
vector y to Dh ^y. For this reason, Dh „ is also called an order h linear filter 
[11]. 

For a > and Aq > 0, we introduce the following class of matrices: 
(1.7) 

M p (a,A ) = {Q€ R pxp : n(*,i) < 1, \Sl(i,j)\ < A (l+\i-j\)- a , 1 < i,j < p}. 

Matrices in A4 p (a, Aq) are not necessarily symmetric. 

Definition 1.2. Fix an order h linear filter D = Dh,, v We say that G 
is sparsifiable by D^ ^ if for sufficiently large p, DG G M p {a, Aq) for some 
constants a > 1 and Aq > 0. 

In the long memory time series model, G can be sparsified by an order 1 
linear filter. In the change-point model, G can be sparsified by an order 2 
linear filter. 
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The main benefit of linear filtering is that it induces sparsity in the Graph 
of Strong Dependence (GOSD) to be introduced below. Recall that the suf- 
ficient statistics Y = X'Y ~ N(G(3,G). Applying a linear filter D = 
to Y gives 

(1.8) d ~ N(Bf3, H), 

where d = D(X'Y), B = DG, and H = DGD'. Note that no information is 
lost when we reduce Model (1.1) to Model (1.8). 

At the same time, if G is sparsifiable by D = D^^, then both the matrices 
B and H are sparse, in the sense that each row of either matrix has relatively 
few large coordinates. In other words, for a properly small threshold 5 > 
to be determined, let B* and H* be the regularized matrices of B and H, 
respectively: 

B*(i,j) = B(i,j)l{\B(i,j)\ > 5}, H*(i,j) = H(i,j)l{\H(i,j)\ >5}, l<i 
It is seen that 

(1.9) d« N{B*(3,H*), 

where each row of B* or H* has relatively few nonzeros. Compared to (1.8), 
(1.9) is much easier to track analytically, but it contains almost all the 
information about /3. 

The above observation naturally motivates the following graph, which we 
call the Graph of Strong Dependence (GOSD). 

Definition 1.3. For a given parameter 5, the GOSD is the graph Q* = 
(V,E) with nodes V = {1,2, ... ,p} and there is an edge between i and j 
when any of the three numbers H*(i,j), B*(i,j), and B*(j,i) is nonzero. 

Definition 1.4. A graph Q = (V,E) is called K-sparse if the degree of 
each node < K . 

The definition of GOSD depends on a tuning parameter 5, the choice 
of which is not critical, and it is generally sufficient if we choose 5 = 8„ = 
l/log(p); see Section 5.1 for details. With such a choice of 5, it can be shown 
that in a general context, GOSD is -fT-sparse, where K = Kg does not exceed 
a multi-log(p) term as p — > oo (see Lemma 5.1). 
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1.3. Interplay between the graph sparsity and signal sparsity. With these 
being said, it remains unclear how the sparsity of Q* helps in variable se- 
lection. In fact, even when Q* is 2-sparse, it is possible that a node k is 
connected — through possible long paths — to many other nodes; it is unclear 
how to remove the effect of these nodes when we try to estimate 

Somewhat surprisingly, the answer lies in an interesting interplay between 
the signal sparsity and graph sparsity To see this point, let S = S(/3) be 
the support of f3, and let Q* s be the subgraph of Q* formed by the nodes in 
S only. Given the sparsity of Q*, if the signal vector (3 is also sparse, then it 
is likely that the sizes of all components of Qg (a component of a graph is 
a maximal connected subgraph) are uniformly small. This is justified in the 
following lemma which is proved in [20]. 

Lemma 1.1. Suppose Q* is K -sparse and the support S = S(f3) is a 

realization from f3j *~ (1 — e)uQ + en, where uq is the point mass at and it is 
any distribution with support C M\{0}. With a probability (from randomness 
of S) at least 1 — p{eeK) m+1 , Q* s decomposes into many components with 
size no larger than m. 

In this paper, we are primarily interested in cases where for large p, e < 
p~' d for some parameter i? G (0, 1) and K is bounded by a multi-log (p) 
term. In such cases, the decomposability of Q% holds for a finite m, with 
overwhelming probability. 

Lemma 1.1 delineates an interesting picture: The set of signals decomposes 
into many small-size isolated signal islands (if only we know where), each 
of them is a component of Q%, and different ones are disconnected in the 
GOSD. As a result, the original p-dimensional problem can be viewed as the 
aggregation of many separated small-size subproblems that can be solved 
parallelly. This is the key insight of this paper. 

Note that the decomposability of Qg attributes to the interplay between 
the signal sparsity and the graph sparsity, where the latter attributes to the 
use of linear filtering. The decomposability is not tied to the specific model 
of /3 in Lemma 1.1, and holds for much broader situations (e.g. when b is 
generated by a sparse Ising model [17]). 

1.4. Information leakage and patching. While it largely facilitates the 
decomposability of the model, we must note that the linear filtering also 
induces a so-called problem of information leakage. In this section, we discuss 
how linear filtering causes such a problem and how to overcome it by the 
so-called technique of patching. 

The following notation is frequently used in this paper. 
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Definition 1.5. For Ic {1,2, ... ,p}, J c {1, • • • , N}, and a p x N 
matrix X , X x denotes the |X| x N sub-matrix formed by restricting the rows 
of X to I, and X^ ,x denotes the x |X| sub-matrix formed by restricting 
the columns of X to I and rows to J . 

Note that when iV = l,Xisapxl vector, and X 1 is an |X| x 1 vector. 

To appreciate information leakage, we first consider an idealized case 
where each row of G has < K nonzeros. In this case, there is no need 
for linear filtering, so B = H = G and d = Y. Recall that Q% consists of 
many signal islands and let X be one of them. It is seen that 

(1.10) d x w N(G x ' x f, G X ' X ), 

and how well we can estimate 1 is captured by the Fisher Information 
Matrix G x > x [21]. 

Come back to the case where G is non-sparse. Interestingly, despite the 
strong correlations, G x,x continues to be the Fisher information for esti- 
mating (3 X . However, when G is non-sparse, we must use a linear filtering 
D = Dh tV as suggested, and we have 

(1.11) d x ps N(B X ' X fi x , H x ' x ). 

Moreover, letting J = {1 < j < p : D(i,j) ^ for some i £ X}, it follows 
that 

B x > x fi x = D X > J G J > X P X . 

By the definition of D, |,7| > |X|, and the dimension of the following null 
space > 1: 

(1.12) Null(l, J) = {£ G m) Jl : D X ' J C = 0}. 

Compare (1.11) with (1.10), and imagine the oracle situation where we are 
told the mean vector of d x in both. The difference is that, we can fully 
recover (3 X using (1.10), but are not able to do so with only (1.11). In other 
words, the information containing 1 is partially lost in (1.11): if we estimate 
I3 X with (1.11) alone, we will never achieve the desired accuracy. 

The argument is validated in Lemma 1.2 below, where the Fisher infor- 
mation associated with (1.11) is strictly "smaller" than G x,x ; the difference 
between two matrices can be derived by taking X + = X and J + = J in 

(1.13) . We call this phenomenon "information leakage". 

To mitigate this, we expand the information content by including data 
in the neighborhood of X. This process is called "patching". Let X + be 
an extension of X by adding a few neighboring nodes, and define similarly 
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J+ = {1 < j < p : D(i,j) ^ for some i £ X + } and Null(X + ,J+). 
Assuming that there is no edge between any node in X + and any node in 

(1.13) d x+ « N(B X+ ' X (3 X ,H X+ > X+ ). 

The Fisher Information Matrix for under Model (1.13) is larger than that 
of (1.11), which is captured in the following lemma. 

Lemma 1.2. The Fisher Information Matrix associated with Model (1.13) 

is 

(1.14) G X ' L — [U(U'(G J+,J+ )~ 1 U)~ 1 U'] 1 * 1 , 

where U is any \J~ + \ x (| l 7 + | — \X + \) matrix whose columns form an or- 
thonormal basis of Null(X + ,J7" + ). 

When the size of X + becomes appropriately large, the second matrix in 
(1.14) is small element-wise (and so is negligible) under mild conditions (see 
details in Lemma 2.3). This matrix is usually non-negligible if we set X + = X 
and J + = J (i.e., without patching). 

Example 1. We illustrate the above phenomenon with an example where 
p = 5000, G is the matrix satisfying G(i,j) = [1 + 5\i — j|] -0 ' 95 for all 
1 < i,j < P, and D = D h>v with h = 1 and 77 = (1,-1)'. If X = {2000}, 
then G x,x = 1, but the Fisher information associated with Model (1.11) 
is 0.5. The gap can be substantially narrowed if we patch with X + = 
{1990, 1991, 2010}, in which case the Fisher information in (1.14) is 
0.904. 

1.5. Covariance Assisted Screening and Estimation (CASE). In sum- 
mary, we start from the post-filtering regression model 

d = DY, where Y = X'Y and D = D^^ is a linear filter. 

We have observed the following. 

• Signal Decomposability. Linear filtering induces sparsity in GOSD, a 
graph constructed from the Gram matrix G. In this graph, the set of 
all true signal decomposes into many small-size signal islands, each 
signal island is a component of GOSD. 

• Information Patching. Linear filtering also causes information leakage, 
which can be overcome by delicate patching technique. 
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(a) (b) 

Fig 1. Illustration of Graph of Strong Dependence (GOSD). Red: signal nodes. Blue: noise 
nodes, (a) GOSD with 10 nodes, (b) Nodes of GOSD that survived the PS -step. 

Naturally, these motivate a two-stage Screen and Clean variable selection ap- 
proach which we call Covariance Assisted Screening and Estimation (CASE). 
CASE contains a Patching and Screening (PS) step, and a Patching and Es- 
timation (PE) step. 

• PS -step. We use sequential x 2_ tests to identify candidates for each 
signal island. Each x 2 -test is guided by Q*, and aided by a carefully 
designed patching step. This achieves multivariate screening without 
visiting all submodels. 

• PE-step. We re-investigate each candidate with penalized MLE and 
certain patching technique, in hope of removing false positives. 

For the purpose of patching, the PS-step and the PS-step use tuning 
integers £ ps and £ pe , respectively. The following notations are frequently 
used in this paper. 

Definition 1.6. For any index 1 < i < p, {i} ps = {1 < j < p : \ j — i\ < 
£ ps }. For any subset Z of {1, 2, . . . ,p}, I ps = Uj 6 j{i} ps . Similar notation 
applies to {i} pe andX pe . 

We now discuss two steps in detail. Consider the PS-step first. Fix m > 1. 
Suppose that Q* has a total of T connected subgraphs with size < m, which 
we denote by {Gt}J=i, arranged in the ascending order of the sizes, with ties 
breaking lexicographically. 

Example 2(a). We illustrate this with a toy example, where p = 10 
and the GOSD is displayed in Figure 1(a). For m = 3, GOSD has T = 30 
connected subgraphs, which we arrange as follows. Note that {^i}^ = ^ are 
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singletons, {Qt}t=n are connected pairs, and {Qt}t=2i are connected triplets. 
{1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, {10} 

{1, 2}, {1, 7}, {2, 4}, {3, 4}, {4, 5}, {5, 6}, {7, 8}, {8, 9}, {8, 10}, {9, 10} 

{1, 2, 4}, {1, 2, 7}, {1, 7, 8}, {2, 3, 4}, {2, 4, 5}, {3, 4, 5}, {4, 5, 6}, {7, 8, 9}, {7, 8, 10}, {8, 9, 10} 

In this example, the multivariate screening exams sequentially only the 
30 submodels above to decide whether any variables have additional utili- 
ties given the variables recruited before, via x 2 -tests. The first 10 screening 
problems are just the univariate screening. After that, starting from bivari- 
ate screening, we examine the variables given those selected so far. Suppose 
that we are examining the variables {1,2}. The testing problem depends on 
how variables {1, 2} are selected in the previous steps. For example, if vari- 
ables {1, 2, 4, 6} have already been selected in the univariate screening, there 
is no new recruitment and we move on to examine the submodel {1,7}. If 
the variables {1,4,6} have been recruited so far, we need to test if variable 
{2} has additional contributions given variable {1}. If the variables {4,6} 
have been recruited in the previous steps, we will examine whether vari- 
ables {1,2} together have any significant contributions. Therefore, we have 
never run regression for more than two variables. Similarly, for trivariate 
screening, we will never run regression for more than 3 variables. Clearly, 
multivariate screening improves the marginal screening in that it gives sig- 
nificant variables chances to be recruited if it is wrongly excluded by the 
marginal method. 

We now formally describe the procedure. The PS'-step contains T sub- 
stages, where we screen Qt sequentially, t = 1,2, ... ,T. Let IA^> be the set 
of retained indices at the end of stage t, with = as the convention. 
For 1 < t < T, the t-th sub-stage contains two sub-steps. 

• (Initial step). Let N = U^ 1 ^ fl Qt represent the set of nodes in Q t that 
have already been accepted by the end of the (t — l)-th sub-stage, and 
let F = Qt \ N be the set of other nodes in Qt- 

• ( Updating step). Write for short X = Q t . Fixing a tuning parameter £ ps 
for patching, introduce 

(1.15) 

w = (B XPS ' x y(H XPS ' XPS y 1 d XPB , Q = (B XPS ' X )>(H XPS ' XPS )-\B XPS ' X ), 

where W is a random vector and Q can be thought of as the covariance 
matrix of W. Define W^, a subvector of W , and a submatrix 

of Q, as follows: 
(1.16) 

Wff = (B XPS ^)'(H XPS ' XPS )~ 1 d XPS , Qfij = (B XPS ^)'{H XPS > XPS )-\B XPS ^). 
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Introduce the test statistic 

(1.17) T(d, F, N) = W'Q- X W - W'fiiQfl^Wfi. 

For a threshold t = t(F, N) to be determined, we update the set 
of retained nodes by = U F if T(d,F,N) > t, and let 

£/(*) = U^ 1 ) otherwise. In other words, we accept nodes in F only 
when they have additional utilities. 

The PS-step terminates when t = T, at which point, we write IA* = U^ T \ 
and so 

IA* = the set of all retained indices at the end of the -PS-step. 

In the PS-step, as we screen, we accept nodes sequentially. Once a node 
is accepted in the PS-step, it stays there till the end of the PS-step; of 
course, this node could be killed in the PS-step. In spirit, this is similar to 
the well-known forward regression method, but the implementation of two 
methods are significantly different. 

The PS-step uses a collection of tuning thresholds 

Q = {t(F, N) : (F, N) are defined above}. 

A convenient choice for these thresholds is to let t(F,N) = 2q\og(p)\F\ for 
a properly small fixed constant q > 0. See Section 1.9 (and also Sections 
1.10-1.11) for more discussion on the choices of t(F,N). 

How does the PS-step help in variable selection? In Section 2, we show that 
in a broad context, provided that the tuning parameters t(F, N) are properly 
set, the PS-step has two noteworthy properties: the Sure Screening (SS) 
property and the Separable After Screening (SAS) property. The SS property 
says that IA* contains all but a negligible fraction of the true signals. The 
SAS property says that if we view U* as a subgraph of Q* (more precisely, 
as a subgraph of G + , an expanded graph of Q* to be introduce below), then 
this subgraph decomposes into many disconnected components, each having 
a moderate size. 

Together, the SS property and the SAS property enable us to reduce the 
original large-scale problem to many parallel small-size regression problems, 
and pave the way for the P-E-step. See Section 2 for details. 

Example 2(b). We illustrate the above points with the toy example in 
Example 2(a). Suppose after the PS-step, the set of retained indices IA* is 
{1,4,5,7,8,9}; see Figure 1(b). In this example, we have a total of three 
signal nodes, {1}, {4}, and {8}, which are all retained in IA* and so the 
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PS-step yields Sure Screening. On the other hand, U* contains a few nodes 
of false positives, which will be further cleaned in the PS-step. At the same 
time, viewing it as a subgraph of Q*, U* decomposes into two disconnected 
components, {1,7,8,9} and {4,5}; compare Figure 1(a). The SS property 
and the SAS property enable us to reduce the original problem of 10 nodes 
to two parallel regression problems, one with 4 nodes, and the other with 2 
nodes. 

We now discuss the PS-step. Recall that £ pe is the tuning parameter 
for the patching of the PS-step, and let {i} pe be as in Definition 1.6. The 
following graph can be viewed as an expanded graph of Q* . 

Definition 1.7. Let Q + = (V,E) be the graph where V = {1,2, . . . ,p} 
and there is an edge between nodes i and j when there exist nodes k G {i} pe 
and k G {j} pe such that there is an edge between k and k' in Q* . 

Recall that U* is the set of retained indices at the end of the PS-step. 

Definition 1.8. Fix a graph Q and its subgraph X. We say X <Q if 
X is a connected subgraph of Q , and X <d Q if X is a component (maximal 
connected subgraph) of Q. 

Fix 1 < j < p. When j <£ U*, CASE estimates f3j as 0. When j G 14* , 
viewing U* as a subgraph of Q + , there is a unique subgraph X such that 
j G X <iU*. Fix two tuning parameters u pe and v pe . We estimate f3 x by 
minimizing 

(1.18) mmj^cF 6 -lf^*0)\H^^)-\^ -I?^*e) + !^\\0\\X 

where 9 is an \X\ x 1 vector where each nonzero coordinate > v pe , and ||#||o 
denotes the L°-norm of 9. Putting these together gives the final estimator of 
CASE, which we denote by f3 case = (3 case (Y; 8, m, Q, £ ps ,£ pe , u pe ,v pe , D h)V , X,p) 

CASE uses tuning parameters (5,m,Q,£ ps ,£ pe ,u pe ,v pe ). Earlier in this 
paper, we have briefly discussed how to choose (5, Q). As for m, usually, a 
choice of m = 3 is sufficient unless the signals are relatively 'dense'. The 
choices of (£ ps , £ pe , u pe , v pe ) are addressed in Section 1.9 (see also Sections 
1.10-1.11). 

1.6. Computational complexity of CASE, comparison with multivariate 
screening. The PS-step is closely related to the well-known method of 
marginal screening, and has a moderate computational complexity. 
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Marginal screening selects variables by thresholding the vector d coordinate- 
wise. The method is computationally fast, but it neglects 'local' graphical 
structures, and is thus ineffective. For this reason, in many challenging prob- 
lems, it is desirable to use multivariate screening methods which adapt to 
'local' graphical structures. 

Fix m > 1. An m-variate x 2 -screening procedure is one of such desired 
methods. The method screens all /c-tuples of coordinates of d using a x 2 -test, 
for all k < m, in an exhaustive (brute-force) fashion. Seemingly, the method 
adapts to 'local' graphical structures and could be much more effective than 
marginal screening. However, such a procedure has a computational cost 
of 0(p m ) (excluding the computation cost for obtaining X'Y from (X,Y); 
same below) which is usually not affordable when p is large. 

The main computational innovation of the PS-step is to use a graph- 
assisted m-variate x 2 -screening, which is both effective in variable selection 
and efficient in computation. In fact, the PS-step only screens fc-tuples of 
coordinates of d that form a connected subgraph of Q* , for all k < m. There- 
fore, if Q* is P'-sparse, then there are < Cp{eK) m+1 connected subgraphs 
of Q* with size < m; so if K = K p is no greater than a multi-log(p) term 
(see Definition 1.10), then the computational complexity of the PS-step is 
only 0(p), up to a multi-log (p) term. 

Example 2(c). We illustrate the difference between the above three 
methods with the toy example in Example 2(a), where p = 10 and the GOSD 
is displayed in Figure 1(a). Suppose we choose m = 3. Marginal screening 
screens all 10 single nodes of the GOSD. The brute- force m-variate screening 
screens all fc-tuples of indices, 1 < k < m, with a total of (?)+•• •+(™) = ^5 
such fc-tuples. The m-variate screening in the PS-step only screens A:-tuples 
that are connected subgraphs of Q* , for 1 < k < m, and in this example, we 
only have 30 such connected subgraphs. 

The computational complexity of the PP-step consists two parts. The first 
part is the complexity of obtaining all components oiU*, which is 0{pK) 
and where K is the maximum degree of Q + \ note that for settings considered 
in this paper, K = Kp does not exceed a multi-log(p) term (see Lemma 5.2). 
The second part of the complexity comes from solving (1.18), which hinges 
on the maximal size of I. In Lemma 2.2, we show that in a broad context, 
the maximal size of I does not exceed a constant Iq, provided the thresholds 
Q are properly set. Numerical studies in Section 3 also support this point. 
Therefore, the complexity in this part does not exceed p ■ 3 °. As a result, 
the computational complexity of the PP-step is moderate. Here, the bound 
0(pK+p-3 l °) is conservative; the actual computational complexity is much 
smaller than this. 
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How does CASE perform? In Sections 1.7-1.9, we set up an asymptotic 
framework and show that CASE is asymptotically minimax in terms of the 
Hamming distance over a wide class of situations. In Sections 1.10-1.11, we 
apply CASE to the long-memory time series and the change-point model, 
and elaborate the optimality of CASE in such models with the so-called 
phase diagram. 

1.7. Asymptotic Rare and Weak model. In this section, we add an asymp- 
totic framework to the Rare and Weak signal model RW(e, r, li) introduced 
in Section 1.1. We use p as the driving asymptotic parameter and tie (e, r) 
to p through some fixed parameters. 

In particular, we fix i? G (0, 1) and model the sparse parameter e by 

(1.19) e = e p =p- {> . 

Note that as p grows, the signal becomes increasingly sparse. At this sparsity 
level, it turns out that the most interesting range of signal strength is r = 
0(-v/log(p)). For much smaller r, successful recovery is impossible. For much 
larger r, the problem is relatively easy. In light of this, we fix r > and let 

(1.20) t = t p = V2rlog(p). 

At the same time, recalling that in RW(e,T, li), we require \x G G p (t) so 
that > r for all 1 < i < p. Fixing a > 1, we now further restrict fi to 
the following subset of G p (t): 

(1.21) %(i~ P , a) = {fi£ @ p (t p ) : r p < |//;| < ar p , l<i<p}. 

Definition 1.9. We call (1.19)-(1.21) the Asymptotic Rare and Weak 
model ARW (■#, r, a, ll) . 

Requiring the strength of each signal < ar p is mainly for technical reasons, 
and hopefully, such a constraint can be removed in the near future. From a 
practical point of view, since usually we do not have sufficient information 
on li, we prefer to have a larger o: we hope that when a is properly large, 
0*(r p ,a) is broad enough, so that neither the optimal procedure nor the 
minimax risk needs to adapt to a. 

Towards this end, we impose some mild regularity conditions on a and 
the Gram matrix G. Let g be the smallest integer such that 



(1.22) 



g > max{(-# + r) 2 / (2#r), m}. 
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For any p x p Gram matrix G and 1 < k < p, let At(G) be the minimum of 
the smallest eigenvalues of all k x k principle sub-matrices of G. Introduce 

(1.23) M p (c ,g) = {G is a p x p Gram matrix, X* k (G) > Co, 1 < k < g}. 

For any two subsets Vq and V\ of {1, 2, . . . ,p}, consider the optimization 
problem 

(ei 0) (V , Vv, G), #(Vo, Vv, GO) = argmm{(0« - 9^)'G{9^ - 9^)}, 

( k) (k) 
up to the constraints that |# 4 | > r p if i 6 and 0- =0 otherwise, where 

fc = 0, 1, and that in the special case of Vq = Vx, the sign vectors of 9^ and 

are unequal. Introduce 



<(G) 



max 



max{ || ^i 0) (X^o , ; || oo , 1 1 (T^o , ; C) 1 1 oo } . 



{(V ,Vi):|VoUVi|<s} 
The following lemma is elementary, so we omit the proof. 

Lemma 1.3. For any G £ M p (cq, g), there is a constant C = C(co,g) > 
such that a*(G) < C. 

In this paper, except for Section 1.11 where we discuss the change-point 
model, we assume 



(1.24) 



GeM(c ,g), a>a*JG). 



Under such conditions, G*(r p , a) is broad enough and the minimax risk (to be 
introduced below) does not depend on a. See Section 1.8 for more discussion. 

For any variable selection procedure /3, we measure the performance by 
the Hamming distance 



h p ((3;P,G) = E 



v 

^l{ Sgn (4)^ S gn(/3,-)} 



where the expectation is taken with respect to /?. Here, for any p x 1 vector 
£, sgn(£) denotes the sign vector (for any number x, sgn(x) = 1,0, —1 when 
x < 0, x = 0, and x > correspondingly). 

Under ARWffl, r, a, fx), = b o so the overall Hamming distance is 



H p 0;e p ,i2,G) = E ep h p 0;0,G) 



X 
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where E ep is the expectation with respect to the law of b. Finally, the mini- 
max Hamming distance under ARW($, r, a, p) is 



In next section, we will see that the minimax Hamming distance does not 
depend on a as long as (1-24) holds. 

In many recent works, the probability of exact support recovery or oracle 
property is used to assess optimality, e.g. [9, 35]. However, when signals are 
rare and weak, exact support recovery is usually impossible, and the Ham- 
ming distance is a more appropriate criterion for assessing optimality. In 
comparison, study on the minimax Hamming distance is not only mathe- 
matically more demanding but also scientifically more relevant than that on 
the oracle property. 

1.8. Lower bound for the minimax Hamming distance. We view the (global) 
Hamming distance as the aggregation of 'local' Hamming distances. To con- 
struct a lower bound for the (global) minimax Hamming distance, the key 
is to construct lower bounds for 'local' Hamming errors. Fix 1 < j < p. The 
'local' Hamming error at index j is the risk we make among the neighboring 
indices of j in GOSD, say, {k : d(j,k) < <?}, where g is as in (1.22) and 
d(j,k) is the geodesic distance between j and k in the GOSD. The lower 
bound for such a 'local' Hamming error is characterized by an exponent pj, 
which we now introduce. 

For any subset V C {1, 2, . . . ,p}, let ly be the p x 1 vector such that the 
j-th coordinate is 1 if j € V and otherwise. Fixing two subsets Vq and V\ 
of {1, 2, . . . ,p}, introduce 



Hamm* ($, r, a, G) = inf sup H p ((3;e p ,p,G). 

$ /^e©*(T p ,a) 



(1.25) 




{e( fe )=/ Vfc o At ( fc ): At ( fc )ee;(rp,a),fc=O,l,sgn(0(°))^sgn(e( 1 ))} 



mm 




and 



(1.26) 




The exponent p* = p* r, a, G) is defined by 



p*(ti,r,a,G) = 

[Vq, 




p(Yo,Vi). 



The following notation L p is frequently used in this paper. 
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Definition 1.10. L p , as a positive sequence indexed by p, is called a 
multi-\og(p) term if for any fixed 5 > 0, lim^oo L p p s = oo and lim^oo L p p~ s 
0. 

It can be shown that L p p~ p i provides a lower bound for the 'local' mini- 
max Hamming distance at index j, and that when (1.24) holds, p*j{~&, r, a, G) 
does not depend on a; see [20, Section 1.5] for details. In the remaining part 
of the paper, we will write it as Pj($, r, G) for short. 

At the same time, in order for the aggregation of all lower bounds for 
'local' Hamming errors to give a lower bound for the 'global' Hamming 
distance, we need to introduce Graph of Least Favorables (GOLF). Towards 
this end, recalling g and p(Vo, V\) as in (1.22) and (1.26), respectively, let 

(Yoji V *j) = ar S mm {(v; ) y 1 ):ieFouy 1 ,|F uy 1 |< ;? }P( v b, V x ), 

and when there is a tie, pick the one that appears first lexicographically. We 
can think (Vg*-, V*j) as the 'least favorable' configuration at index j; see [20, 
Section 1.5] for details. 

Definition 1.11. GOLF is the graphQ = (V,E) whereV = {1,2, . . . ,p} 
and there is an edge between j andk if and only i/(VgjUV^*-)n(V^* fc UV r 1 * fc ) 7^ 0. 

The following theorem is similar to [20, Theorem 1.1] so we omit the 
proof. 

Theorem 1.1. Suppose (1-24) holds so that pj (1?, r, a, G) does not de- 
pend on the parameter a for sufficiently large p. Asp — > 00, Hamm*^, r, a, G) 

Lp[d p (Q <> )]~ 1 YTi=\P~ Pi< ^' r > where d p (Q <> ) is the maximum degree of all 
nodes in Q°. 

In many examples, including those of primary interest of this paper, 
(1.28) dpiG*) < L p . 

In such cases, we have the following lower bound: 



(1.29) 



p 

Hamm p {tf,r,a,G) > L p ^p- p *^' r > G) . 
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1.9. Main results. In this section, we show that in a broad context, pro- 
vided the tuning parameters are properly set, CASE achieves the lower 
bound prescribed in Theorem 1.1, up to some L p terms. Therefore, the 
lower bound in Theorem 1.1 is tight, and CASE achieves the optimal rate 
of convergence. 

For a given 7 > 0, we focus on linear models with the Gram matrix from 

M;(7, g, c , A x ) = M p (c ,g) n M p (-y, A^, 

where we recall that the two terms on the right hand side are defined in 
(1.7) and (1.23), respectively. The following lemma is proved in Section 5. 

Lemma 1.4. For G e M*(j, g, c , Ai), d p {Q <> ) < L p . As a result, Hamm*(i?, r, a, G) > 

/„>;; '■■-■ rA: '- 

For any linear filter D = D^^, let 

ip v (z) = l + r] 1 z + ...+ n h z h 

be the so-called characterization polynomial. We assume the following regu- 
larity conditions. 

• Regularization Condition A (RCA). For any root zq of (p„(z), \zq\ > 1. 

• Regularization Condition B (RCB). There are constants k > and 
ci > such that X* k (DGD') > cik~ K (see Section 1.7 for the definition 
of XI). 

For many well-known linear filters such as adjacent differences, seasonal 
differences, etc., RCA is satisfied. Also, RCB is only a mild condition since 
k can be any positive number. For example, RCB holds in the change- 
point model and long-memory time series model with certain D matrices. 
In general, k is not because when DG is sparse, DGD' is very likely to 
be approximately singular and the associated value of A£ can be small when 
k is large. This is true even for very simple G (e.g. G = I p , D = Di n and 
ri = (1 -!)')■ 

At the same time, these conditions can be further relaxed. For example, 
for the change-point problem, the Gram matrix has barely any off-diagonal 
decay, and does not belong to Ait. Nevertheless, with slight modification in 
the procedure, the main results continue to hold. 

CASE uses tuning parameters (5,m,Q,£ ps ,£ pe ,u pe ,v pe ). The choice of 5 
is flexible, and we usually set 5 = l/log(p). For the main theorem below, 
we treat m as given. In practice, taking m to be a small integer (say, < 3) is 
usually sufficient, unless the signals are relatively dense (say, $ < 1/4). The 
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choice of £ ps and £ pe are also relatively flexible, and letting £ ps be a sufficiently 
large constant and £ pe be (log(p))^ for some constant v < (1 — 1/q)/(k + 1/2) 
is sufficient, where a is as in Definition 1.2, and n is as in RCB. 
At the same time, in principle, the optimal choices of (u pe ,v pe ) are 

(1.30) u pe = v^tflogp, v pe = v^rlogp, 

which depend on the underlying parameters r) that are unknown to us. 
Despite this, our numeric studies in Section 3 suggest that the choices of 
(u pe ,v pe ) are relatively flexible; see Sections 3-4 for more discussions. 

Last, we discuss how to choose Q = {t(F, N) : (F, N) are defined as in the 
P5-step}. Let t(F,N) = 2glog(p), where q > is a constant. It turns out 
that the main result (Theorem 1.2 below) holds as long as 

(1.31) qo<q<q*(F,N), 

where qo > is an appropriately small constant, and for any subsets (F, N), 



(1.32) 

q*(F, N) = max {g : (\F\ + \N\)ti + [{^Cj{F,N)r - y^\F\) + } 2 > ^{F, JV)| ; 



(\F\+2\N\)d f \w(F,N)r, \F\ is even, 

2 + \ f + \[(y/u>(F,N)r - $/^uj(F,N)r) + ] 2 , \F\ is odd, 



u(F, N) = min £'[G F ' F - G F,N (G N,N )~ 1 G N,F ]£, 
ZeU\ F h\£i\>l 



Q(F,N)= min £'[Qf,f - Qf,n(Qn,n) 1 Qn,f}^ 

where Q f ,n = (B^^)' (H XPa ? ps )-\B XPa > N ) with 1 = F U N, and Qjv,f, 
Q^jt and Qat.at are defined similarly. Compared to (1.15), we see that Qf,n, 
Qf,n, Qn,f and Qn,N are all submatrices of Q. Hence, uj(F,N) can be 
viewed as a counterpart of lo(F, N) by replacing the submatrices of G x,x by 
the corresponding ones of Q. 

From a practical point of view, there is a trade-off in choosing q: a larger q 
would increase the number of Type II errors in the -PS-step, but would also 



here, 
(1.33) 

ip(F,N) = 

with 
(1.34) 

and 
(1.35) 
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reduce the computation cost in the P-E-step. The following is a convenient 
choice which we recommend in this paper: 

(1.36) t(F,N) = 2q\F\log(p), 

where < q < c^r/A is a constant and cq is as in Mt(^, g, Co, A{). 
We are now ready for the main result of this paper. 

Theorem 1.2. Suppose that for sufficiently large p, G £ Ai*^, g, cq, A\), 
D hiV G £ M p (a,A ) with a > 1, and that RCA-RCB hold. Consider j3 case = 
pease ^ m ^ g ; £ ps ^ £P e ^ u pe , v pe , Dh,r>i X, p) with the tuning parameters spec- 
ified above. Then as p — ^ oo, 

(1.37) sup H p (p case ;e p ^,G) < L p \p l ~^^ + j^p-^ r ^) + o{\). 

M6©p(i>>*) j=l 

Combine Lemma 1.4 and Theorem 1.2. Given the parameter m is appro- 
priately large, both the upper bound and the lower bound are tight and 
CASE achieves the optimal rate of convergence prescribed by 

v 

(1.38) Hamm;(i?,r,a,G) = L p ^p^ (, '' r ' G) +o(l). 

3=1 

Theorem 1.2 is proved in Section 2, where we explain the key idea behind 
the procedure, as well as the selection of the tuning parameters. 

1.10. Application to the long-memory time series model. The long-memory 
time series model in Section 1 can be written as a regression model: 

Y = Xf3 + z, z~N{0,I n ), 

where the Gram matrix G is asymptotically Toeplitz and has slow off- 
diagonal decays. Without loss of generality, we consider the idealized case 
where G is an exact Toeplitz matrix generated by a spectral density /: 

i r 

G(i,j) = — J cosfli - j\u)f(uj)duj, l<i,j<p. 

In the literature [6, 23], the spectral density for a long-memory process is 
usually characterized as 

(1.39) f(cj) = \l-e^\-^f*(oj), 
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where (ft G (0, 1/2) is the long-memory parameter, f*(uj) is a positive sym- 
metric function that is continuous on [—it, it] and is twice differentiable ex- 
cept at u = 0. 

In this model, the Gram matrix is non-sparse but it is sparsifiable. To 
see the point, let rj = (1, —1)' and let D = D\^ t be the first-order adjacent 
row-differencing. On one hand, since the spectral density / is singular at the 
origin, it follows from the Fourier analysis that 

\G(i,j)\>c(i+\i-j\r^- 2 ^ 

and hence G is non-sparse. On the other hand, it is seen that 

r\j-i\+i -- 
B(i,j) = ^l uf(u))(\)d\, 

J\3-i\ 

where we recall that B = DG and note that g denotes the Fourier transform 
of g. Compared to f(oo), uif{oj) is non-singular at the origin. Additionally, 
it is seen that B E M p {2 — 2<f>, A), where 2 — 2<j> > 1, so B is sparse (similar 
claim applies to H = DGD'). This shows that G is sparsifiable by adjacent 
row-differencing. 

In this example, there is a function p^ ts {d,r; f) that only depends on 
r, /) such that 

r , ™ ax , , A\Pj(#,r,G) - P* lts {$,r; f)\} -> 0, as p -»• oo, 

{j:log(p)<j<p-log(p)} 

where the subscript 'Its' stands for long-memory time series. The following 
theorem can be derived from Theorem 1.2, and is proved in Section 5. 

Theorem 1.3. For a long-memory time series model where \ (f*)"(oj)\ < 
C|w| -2 , the minimax Hamming distance satisfies Hamm*(i9, r, G) = L p p 1 ~ p its\ & - 
If we apply CASE where (m+l)^ > p^ s (z?, r; /), rj = (1, — 1)' , and the tuning 
parameters are as in Section 1.9, then 

sup H p case ; e p , /i, G) < Lpp 1 -^^,/) + o(1) 

fJ,ee*(T p ,a) 

Theorem 1.3 can be interpreted by the so-called phase diagram. Phase 
diagram is a way to visualize the class of settings where the signals are so 
rare and weak that successful variable selection is simply impossible [19]. In 
detail, for a spectral density / and $ G (0, 1), let 

r!M = rt t Mf) 
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be the unique solution of p* ts (#, r; /) = 1. Note that r = r* ts (?9) characterizes 
the minimal signal strength required for exact support recovery with high 
probability. We have the following proposition, which is proved in Section 5. 

Lemma 1.5. Under the conditions of Theorem 1.3, if (/*)"(0) exists, 
then r* ts (??; /) is a decreasing function in d, with limits 1 and ^ f~ 1 (u})dui 
as $ — > 1 and i? — > 0, respectively. 

Call the two-dimensional space {(??, r): < $ < 1, r > 0} the phase space. 
Interestingly, there is a partition of the phase space as follows. 

• Region of No Recovery {($, r): < r < $, < i? < 1}. In this region, 
the minimax Hamming distance > pe p , where pe p is approximately the 
number of signals. In this region, the signals are too rare and weak and 
successful variable selection is impossible. 

• Region of Almost Full Recovery {(#, r): i? < r < rf t f), < i? < 1}. 
In this region, the minimax Hamming distance is much larger than 
1 but much smaller than pe p . Therefore, the optimal procedure can 
recover most of the signals but not all of them. 

• Region of Exact Recovery {($, r): r > r* ta ($; /), < i9 < 1}. In this re- 
gion, the minimax Hamming distance is o(l). Therefore, the optimal 
procedure recovers all signals with probability ~ 1. 

Because of the partition of the phase space, we call this the phase diagram. 

From time to time, we wish to have a more explicit formula for the rate 
p* lts {-&,r; f) and the critical value r z * ts (t?;/). In general, this is a hard prob- 
lem, but both quantities can be computed numerically when / is given. In 
Figure 2, we display the phase diagrams for the FARIMA(0, <f>, 0) process 
where 



r 2 (i-^ 
r(i-2< 



(1.40) /» 
Take <p = 0.35, 0.25 for example, r* s (tf; /) rj 7.14, 5.08 for small 



1.11. Application to the change-point model. The change-point model in 
the introduction can be viewed as a special case of Model (1.1), where (3 is 
as in (1.5), and the Gram matrix satisfies 

(1.41) G(i,j) = min{i, j}, 1 < i,j <p. 



For technical reasons, it is more convenient not to normalize the diagonals 
of G to 1. 
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Fig 2. Phase diagrams corresponding to the FARIMA(0,<f>,0) process. Left: <f> — 0.35. 
Right: </> = 0.25. 

The change-point model can be viewed as an 'extreme' case of what is 
studied in this paper. On one hand, the Gram matrix G is 'ill-posed' and 
each row of G does not satisfy the condition of off-diagonal decay in Theorem 
1.2. On the other hand, G has a very special structure which can be largely 
exploited. In fact, if we sparsify G with the linear filter D = D2 tn , where 
i] = (1, —2, 1)', it is seen that B = DG = I p , and H = DGD' is a tri-diagonal 
matrix with H(i,j) = 2 ■ l{i = j} — — j\ = 1} — l{i = j = p}, which are 
very simple matrices. For these reasons, we modify the CASE as follows. 

• Due to the simple structure of B, we don't need patching in the PS- 
step (i.e., £ ps = 0). 

• For the same reason, the choices of thresholds t(F, N) are more flexible 
than before, and taking t(F, N) = 2q log(p) for a proper constant q > 
works. 

• However, since H is 'extreme' (the smallest eigenvalue tends to as 
p — > oo), we have to modify the PE-step carefully. 

In detail, the P-E-step for the change-point model is as follows. Given £ pe , 
let Q + be as in Definition 1.7. Recall that IA* denotes the set of all retained 
indices at the end of the PS-step. Viewing U* as a subgraph of Q + ', and 
let I <\U* be one of its components. The goal is to split I into iV different 
subsets 

i = i( 1 )u...u#», 

and for each subset I^ k \ 1 < k < N, we construct a patched set X^' pe . We 
then estimate j3 separately using (1.18). Putting f3 together gives our 
estimate of (3 1 '. 

The subsets {(l( k \l( k ^' pe )}^ =1 are recursively constructed as follows. De- 
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note I = M = (£ pe /2) 1 ^ l+1 \ and write 

Z = {ji,h, ■ ■ ■ ,jl}, 3i < 32 < ■ ■ ■ < 31- 

First, letting k\ be the largest index such that j kl — jki-i > £ pe /M, define 

X (1) = {j fel ,---,j/}, and l^ pe = {j kl -£ pe /(2M),--. ,j l + £ pe /2}. 

Next, letting ki < k\ be the largest index such that j k2 — jk 2 -i > £ pe /M 2 , 
define 

X (2) = { Jk2 , ■ ■ ■ ,j kl }, l (2) ' pe = {jk 2 ~ £ pe /(2M 2 ), ■■■,j kl + £ pe /(2M)}. 

Continue this process until for some N, 1 < N < I, k]y = 1. In this con- 
struction, for each 1 < k < N, if we arrange all the nodes of l( k ~>' pe in the 
ascending order, then the number of nodes in front of is significantly 
smaller than the number of nodes behind 1^ . 

In practice, we introduce a suboptimal but much simpler patching ap- 
proach as follows. Fix a component X = {ji, • • • , j{\ of Q + . In this approach, 
instead of splitting it into smaller sets and patching them separately as in 
the previous approach, we patch the whole set Z by 

(1.42) ZP e = {i : h - £ pe /4 <i<ji + 3F e /4}, 

and estimate (3 X using (1.18). Our numeric studies show that two approaches 
have comparable performances. 
Define 



(143) p*($r) = l ^ + r/4 ' rA?<6 + 2VlO, 

1 j W>'> \ 3tf + (r/2-$) 2 /(2r), r/<? > 6 + 2y/W, 

where 'cp' stands for change-point. Choose the tuning parameters of CASE 
such that 

(1.44) £ pe = 21og(p), u pe = y/2d\og{p), and v pe = y / 2rlog(p), 

that (m+ 1)# > p* p (i9,r), and that < q < |(\/2- l) 2 (recall that we take 
t(F,N) = 2qlog(p) for all (F,N) in the change-point setting). Note that 
the choice oil pe is different from that in Section 1.5. The main result in this 
section is the following theorem which is proved in Section 5. 

Theorem 1.4. For the change-point model, the minimax Hamming dis- 
tance satisfies Hamm* (i?, r, 67) = L p p 1 ~ p v>(' & > r ') . Furthermore, the CASE p case 
with the tuning parameters specified above satisfies 

sup H p (p case ; e p , fi, G) < Lpp 1 '^^) + o(1) 
Atee;(r p ,a) 
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Fig 3. Phase diagrams corresponding to the change-point model. Left: CASE; the boundary 
is decided by (4-10i?) + 2 x /(2 - 5tf) 2 - i9 2 (left part) and 4(1-1?) (right part). Right: hard 
thresholding; the upper boundary is decided by 2(1 + \/l — $) 2 and £/ie lower boundary is 
decided by 2$. 

It is noteworthy that the exponent p* p ($, r) has a phase change depend- 
ing on the ratios of r/#. The insight is, when r/i9 < 6 + 2\/l0, the minimax 
Hamming distance is dominated by the Hamming errors we make in dis- 
tinguishing between an isolated change point and a pair of adjacent change 
points, and when rj'Q > 6 + 2\/l0, the minimax Hamming distance is domi- 
nated by the Hamming errors of distinguishing the case of consecutive change 
point triplets (say, change points at {j — — 1}) from the case where we 
don't have a change point in the middle of the triplets (that is, the change 
points are only at {j — 1, j + 1}). 

Similarly, the main results on the change-point problem can be visualized 
with the phase diagram, displayed in Figure 3. An interesting point is that, 
it is possible to have almost full recovery even when the signal strength 
parameter r p is as small as o(-v/21og(p)). See the proof of Theorem 1.4 for 
details. 

Alternatively, one may use the following approach to the change-point 
problem. Treat the liner change-point model as a regression model Y = 
Xf3 + z as in Section 1 (Page 2), and let W = (X'X^X'Y be the least- 
squares estimate. It is seen that 

VF~AT(/3,E), 

where we note that S = (X'X)^ 1 is tridiagonal and coincides with H. 
In this simple setting, a natural approach is to apply a coordinate-wise 
thresholding fij hresh = Wjl{\Wj\ > t} to locate the signals. But this neglects 
the covariance of W in detecting the locations of the signals and is not 
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optimal even with the ideal choice of thresholding parameter to, since the 
corresponding risk satisfies 

sup H p th ^\t o )-e p ^,G) = ^-(r/SHW*-). 
{Mee;(rj„a)} 

The proof of this is elementary and omitted. The phase diagram of this 
method is displayed in Figure 3, right panel, which suggests the method is 
non-optimal. 

Other popular methods in locating multiple change-points include the 
global methods (e.g. [16, 25, 29, 33]) and local methods (e.g. SaRa [24]). 
The global methods are usually computationally expensive and can hardly 
be optimal due to the strong correlation nature of this problem. Our pro- 
cedure is related to the local methods but is different in important ways. 
Our method exploits the graphical structures and uses the GOSD to guide 
both the screening and cleaning, but SaRa does not utilize the graphical 
structures and can be shown to be non-optimal. 

1.12. Content. The remaining sections are organized as follows. Section 
2 discusses the key steps for proving Theorem 1.2. Section 3 contains nu- 
meric studies and comparisons with other methods. Section 4 contains sum- 
marizing remarks and discussions. Section 5 contains the proofs for all other 
theorems and lemmas in the paper. 

Throughout this paper, D = D Kv , d = D(X'Y), B = DG, H = DGD', 
and Q* denotes the GOSD (In contrast, d p denotes the degree of GOLF 
and Hp denotes the Hamming distance). Also, M. and C denote the sets 
of real numbers and complex numbers respectively, and MP denotes the p- 
dimensional real Euclidean space. Given < q < oo, for any vector x, \\x\\„ 
denotes the L 9 -norm of x; for any matrix M, \\M\\ q denotes the matrix L q - 
norm of M. When q = 2, ||M|| 9 coincides with the matrix spectral norm; we 
shall omit the subscript q in this case. When M is symmetric, A max (M) and 
^min(M) denote the maximum and minimum eigenvalues of M respectively. 
For two matrices Mi and M2, M± >z M2 means that M\ — M2 is positive 
semi-definite. 

2. Proof of the main theorem. As mentioned before, the success of 
CASE relies on two noteworthy properties: the Sure Screening (SS) property 
and the Separable After Screening (SAS) property. In this section, we discuss 
the two properties in detail, and illustrate how these properties enable us 
to decompose the original regression problem to many small-size regression 
problems which can be fit separately. We then use these properties to prove 
Theorem 1.2. 
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We start with the SS property. Recall that U* is the set of all retained 
indices at the end of the PS-step. The following lemma is proved in Section 
5. 

Lemma 2.1 (SS). Under the conditions of Theorem 1.2, 

£p (ft *k, H K) < lAp 1 -^ + jZp- p ^' r ' G) \ + o(i). 

This says that all but a negligible fraction of signals are retained in U* . 

At the same time, we have the following lemma, which says that as a 
subgraph of Q + , U* splits into many disconnected components, and each 
component has a small size. 

Lemma 2.2 (SAS). As p — > oo, under the conditions of Theorem 1.2, 
there is a fixed integer Iq > such that with probability at least 1 — o(l/p), 
each component ofU* has a size < Iq. 

Together, these two properties enable us to decompose the original re- 
gression problem to many small-size regression problems. To see the point, 
let I be a component of U*, and T pe be the associated patching set. Recall 
that d ~ N(Bf3, H). If we limit our attention to nodes in I pe , then 



(2.45) d XPe = (Bf3) IPe + N(0, H IPe ' IPe ). 
Denote V = {!,■■■ ,p}\U*. Write 

(2.46) (B^ e = B Xpe ' X p x + ^ + ^, 
where 



Now, first, by the SS property, V contains only a negligible number of signals, 
so we expect to see that Halloo to be negligibly small. Second, by the SAS 
property, for any J <\U* and J ^ X, nodes in X and J are not connected 
in Q + . By the way Q + is defined, it follows that nodes in X pe and J are 
not connected in the GOSD Q* . Therefore, we expect to see that ||£i||oo 
is negligibly small as well. These heuristics are validated in the proof of 
Theorem 1.2; see Section 2.1 for details. 
As a result, 

(2.47) d IPe *N(B IPe > I l3 I ,H IPe ? pe ), 
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where the right hand side is a small-size regression model. In other words, 
the original regression model decomposes into many small-size regression 
models, and each has a similar form to that of (2.47). 

We now discuss how to fit Model (2.47). In our model ARW(#, r, a, p), 
1 = b x op x , and P(||/3 ||o = k) ~ e*j. At the same time, given a realization of 
(3 X , d xpe is (approximately) distributed as Gaussian as in (2.47). Combining 
these, for any eligible \I\ x 1 vector 9, the log-likelihood for (3 X = 9 associated 
with (2.47) is 



(2.48) 



Note that 9 is eligible if and only if its nonzero coordinates > r p in magni- 
tude. Comparing (2.48) with (1.18), if the tuning parameters (u pe ,v pe ) are 
set as u pe = \/2'd log(p) and v pe = \Jlr log(p), then the PE-step is actually 
the MLE constrained in ® p (t p ). This explains the optimality of the P-E-step. 

The last missing piece of the puzzle is how the information leakage is 
patched. Consider the oracle situation first where f3 xc is known. In such a 
case, by Y = X'Y ~ N(Gf3, G), it is easy to derive that 

Y x - G x;xc fi xc ~ N(G x ' x f3 x , G x ' x ). 

Comparing this with Model (2.47) and applying Lemma 1.2, we see that 
the information leakage associated with the component I is captured by the 
matrix [U( U'(G Jpe ' Jpe )~ 1 U)~ 1 U'f : ' X , where J pe = {1 < j < p : D(i,j) / 
0, for some i G I pe } and U contains an orthonormal basis of Null(I pe , J pe \ 
To patch the information leakage, we have to show that this matrix has a 
negligible influence. This is justified in the following lemma, which is proved 
in Section 5. 

Lemma 2.3. (Patching). Under the conditions of Theorem 1.2, for any 
Z<G + such that \X\ < l , and any \ J pe \ x (\J pe \ - \l pe \) matrix U whose 
columns form an orthonormal basis of N ull{X pe , J pe ) , 

\\[U(U\G Jpe > Jpe )- l U)- l U'} x > x \\ =o(l), p^oo. 

We are now ready for proving Theorem 1.2. 

2.1. Proof of Theorem 1.2 . For short, write j3 = f] case and p* = p*^, r, G). 
For any p S ®* p {t p , a), write 



H p 0;e p ,p,G) = I + II, 
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where 
(2.49) 

v P 

1 = E p (^ + i K)i 11 = E p C? G u p> + sgn(ft)). 

i=i j=i 

Using Lemma 2.1, I < L p [p 1 -( m+1 ) ,? + YJj=iP~ P * 3 \ + So it is sufficient 
to show 

p 

(2.50) // < Lplp^+W +J2p' P '] + 

View as a subgraph of Q + . By Lemma 2.2, there is an event A p and a 
fixed integer £q such that P{A p ) < o(l/p) and that over the event A p , each 
component of U* has a size < ^o- It is seen that 

p 

Moreover, for each 1 < j < p, there is a unique component X such that 
j € I <U*, and that |X| < £o over the event A p (note that I depends on U* 
and it is random). Since any realization of X must be a connected subgraph 
(but not necessarily a component) of Q + , 

p 

(2.51) II < Y, E P 0' e X< ^P' s S n ^) ^ s S n ^')' ^p)+°( 1 ); 

j=ix-.jex<g+,\i\<i 

see Definition 1.8 for the difference between <l and <. We stress that on the 
right hand side of (2.51), we have changed the meaning of I and use it to 
denote a fixed (non-random) connected subgraph of Q + . 

Next, let £(Z pe ) be the set of nodes that are connected to X pe by a length-1 
path in Q* : 

£(T pe ) = {k : there is an edge between k and k' in Q* for some k' G I pe }. 

Heuristically, S((3) n £(I pe ) is the set of signals that have major effects on 
<f P \ Let E PtX be the event that (S(/3) n £(l pe )) C 1 (note that X is non- 
random and the event is defined with respect to the randomness of (3). From 
(2.51), we have 
(2.52) 

p 

II < E E P G Z<U pi s § n (^') ^ sgn(^), A p nE PtX ) +rem, 
j=ii-.jei<g+,\i\<i 
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where it is seen that 

p _ 

(2.53) rem<Y. E P ( T < U p> A p n E l?) ■ 

j=ix:jex<g+,|x|<« 

The following lemma is proved in Section 5. 

Lemma 2.4. Under the conditions of Theorem 1.2, 
p p 

E p ( x < w p> a p n ^ L f E p (^- K) ■ 

j=ix-.jex<g+,\x\<io j=i 

Combining (2.53) with Lemma 2.4 and using Lemma 2.1, 

p 

(2.54) rem < L p \p L -( m+r >* + J^p"^] + o(l). 

i=i 

Insert (2.54) into (2.52). To show (2.50), it suffices to show for each 1 < j < 
P, 

(2.55) 

P{j EKU;, B ga(40 ^ sgn(/3,-) 5 A> n E PtX ) < L pP -?j . 

x-.jex<g+,\x\<i 

We now further reduce (2.55) to a simpler form using the sparsity of Q + . 
Fix 1 < j < p. The number of subgraphs I satisfying that j £ I < C? + and 
that |X| < Zo is no more than C(eK+) l ° [14], where K+ is the maximum 
degree of Q + . By Lemma 5.1 and Lemma 5.2 (to be stated in Section 5), 
Kp < C(£ pe ) 2 K p , where K p is the maximum degree of Q* , which is an L p 
term. Therefore, C(eK p ) l ° is also an L p term. In other words, the total 
number of terms in the summation of (2.55) is an L p term. As a result, to 
show (2.55), it suffices to show for each fixed I such that j G X < Q + and 
\I\ < lo, 

(2.56) P(j 6l<W*, sgn(/3 j ) + sgn(/3 j ) , ,4 P n E p<x ) < L pV ~ p ^ . 

Moreover, note that the left hand side of (2.56) is no more than 

£ P(Supp(/3 x ) = Vo, Supp(/F) = Fi,X<W;, sgn(^) ± sgn^), A p nE p 
Vo,Vicx-.jev uVi 

where Vq and V\ are any non-random subsets satisfying the restriction. Since 
\X\ < Iq, there are only finite pairs (Vo, V\) in the summation. Therefore, to 
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show (2.56), it is sufficient to show for each fixed triplet (I, Vq, V\) satisfying 

X < Q+, \I\ < l Q , Vq, ViCl and j G V U Vx that 

(2.57) 

P(Supp(/^) = Vb,Supp(/3 z ) = V 1 ,X<SU;,8ea0 j ) + sgn(^), ^nE PiI ) < 

We now show (2.57). Fix (J,Vb,Vi), and write di = d XP \ B 1 = B XP ^ X 
and Hi = H xpe > xpe for short. Define p (J,a) = {6> J : % = or r p < 
\6j\ < aTp} and @ P (I) = © P (Z, oo). Since u pe = -y/ log(p) and v pe = t p , 
the objective function (1.18) in the Pi?-step is 

= \{d x - B^'H^idx - B x 0) + 01og(p)||0|| o . 

Over the event {X <d U*}, (3 X minimizes the objective function, so 

C0 X ) < C(f3 x ). 

As a result, the left hand side of (2.57) is no greater than 
(2.58) 

P(SuppCS x ) = Vb,Supp(/3 x ) = Vx,£0 X ) < £(/? I ),sgn(^) ^ sgn(^), A p nE p ?) . 

We now calculate (2.58). Write for short Qx = B'-^H^Bx, w = T p 2 ((3 x - 
P T YQi0 x - p x ), and define 

Wj(V ,Vx,l) = \ min tfW - pMy Ql (p(i) - pM), 

where the minimum is taken over 0S(°),/3«) such that sgn(/3] 0) ) / sgn(/3j 1) ) 

and /3( fc ) G 9 P (X), Supp(0<*>) = V*, fc = 0, 1. Introduce 
(2.59) 

Pj (V ,Vx;I) =wBx{\V \,\V 1 \y&+^ 



Over the event {Supp( ( 5 I ) = F ,Supp(/3 z ) = Fi}, £(/3 z ) < £(/3 x ) implies 
(2.60) 

-{dx-B^'H^B^-f) > ±0 x -f3 x yB[H^Bx0 x -P x )+(\Vx\-\V o \)mog(p). 
With the notation w, the right hand side of (2.60) is equal to 
(2.61) ^ + (\Vx\-\V \)^log(p). 

To simplify the left hand side of (2.60), we need the following lemma, which 
is proved in Section 5. 
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Lemma 2.5. For any fixed X such that \X\ < Iq, and any realization of [3 
over the event E p x, 

{Bj3f pe =C + B LP6 ?fP, 
for some ( satisfying ||C|| < C^ 6 ) 1 / 2 ^^)]^ 1 - 1 /"^. 

T 1/2 

Using Lemma 2.5, we can write d\ — B\j3 = £ + H-y z, where £ is as in 
Lemma 2.5 and z ~ N(0, I\ X pe\). It follows that the left hand side of (2.60) 
is equal to 

(2.62) -('H^ 1 B 1 x - j?) + z'H~ 1/2 B 1 I - (3 X ). 

First, by Cauchy-Schwartz inequality, the second term in (2.62) is no larger 



than \\z\\ywT^. Second, we argue that the first term in (2.62) is o( H/? 27 — 

/3 x \\t p ). To see the point, it suffices to check ||£>^if-f = o(t p ). In fact, 
note that since B € A4 p (a, Aq), \\Bi\\ < \\B\\ < C; in addition, by RCB, 
ll-fff 1 )! < c^ l \X pe \ K = 0((£ pe ) K ). Applying Lemma 2.5 and noticing that 
£P e = (log(p)) u with v < (1 - l/a)/(« + 1/2), we have ^[H^CW < 
llBillll^llHCll < C(F e ) /t+1 / 2 [log(p)]-( 1 - 1 /a) Tpj and the claim follows. Third, 
from Lemma 1.2 and Lemma 2.3, \\G X,X — Q\\\ = o(l) as p grows. So for 
sufficiently large p, A m i n (Qi) > ^X m in(G X ' X ) > C for some constant C > 0. 

It follows from the definition of w that * vbT p > C\\(3 X — /3 X \\. Combining 



these with (2.62), over the event A p n E„t, the left hand side of (2.60) is no 
larger than 



(2.63) y/wr* (\\z\\+o(t p )). 

Inserting (2.61) and (2.63) into (2.60), we see that over the event {Supp(/3 :r ) 
Vq, Supp(^) = V u C0 x ) < C(P X ),A P n E PiX }, 



(2.6.1! :| >\(^M~r+ m JY ^ ) v^g(p) + o(v^) • 

V wr J i 



2 



Introduce two functions defined over (0,oo): J\(x) = |Vb|$ + j[(\/^ + 
^t») + ] 2 and J 2 (x) = maxMI^ + l^- ^ib») + ] 2 . 
By elementary calculations, Ji(x) > J2(y) for any x > y > 0. Now, by these 
notations, (2.64) can be written equivalently as ||z|| 2 > [J\(zur) — |Vo|#] ■ 
21og(p) + o(log(p)), and Pj(Vo, V\\X) defined in (2.59) reduces to .^('cujr), 
where vjj = zuj(Vo,Vi;X) for short. Moreover, when sgn(/3j) ^ sgn(/3j), 
zd > ujj by definition, and hence Ji(ror) > J2(ro-,r). Combining these, 
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it follows from (2.64) that over the event {Supp(/3 :r ) = V , Supp(/3 X ) = 
Vi,£0 x ) < £(/3 z ),sgn(/3j) / sgn(/f ), A p n E P:X }, 

\\z\\ 2 > [pi<y Q ,Vr,T)-\V Q \d\ ■2\og(p)+o(log(p)), 

where compared to (2.64), the right hand side is now non-random. It follows 

that the probability in (2.58) 

(2.65) 

< p(supp(/3 x ) = Vo, \\z\\ 2 > [PiiVoM;!) - \V O \0] ■ 2 log(p) + o(log(p))\ 

Recall that f3 x = b x o jj 1 , where bfs are independent Bernoulli variables 
with surviving probability e p = p~® . It follows that P(Supp(/3 :r ) = Vq) = 
L p p~\ v °^ . Moreover, ||z|| 2 is independent of 1 , and is distributed as x 2 with 
degree of freedom \I ps \ < L p . From basic properties of the ^-distribution, 
P(p|| 2 > 2Clog(p) + o(log(p))) < L p p~ c for any C > 0. Combining these, 
we find that the term in (2.65) 

(2.66) < Lpp -\v \#-l Pj (v ,VuT)-\v W) = Lpp -Pi(V ,Vi&_ 

The claim follows by combining (2.66) and the following lemma. 

Lemma 2.6. Under conditions of Theorem 1.2, for any (j, Vq, V\,I) sat- 
isfying 1 < Q+, \1\ < l , V Q , ViCl and j£V U V 1; 

Pj (Vo,V 1 ;l)> P * j (#,r,G) + o(l). 

□ 

3. Simulations. We conducted a small-scale simulation experiment. 
The goal is to investigate how CASE performs with representative parame- 
ters. We focus the study on the change-point model and long- memory time 
series model discussed earlier. 

3.1. Change-point model. In this section, we use Model (1.2) to inves- 
tigate the performance of CASE in identifying multiple change-points. For 
a given set of parameters (p, r, a), we set e p = p~^ and r p = ^2r log(p). 

First, we generate a (p — 1) x 1 vector f3 by f3j *~ (1 — £ p )vq + ^-U(t p , ar p ) + 
YU(—aT p ,—T p ), where U(s,t) is the uniform distribution over [s,t] (when 
s = t, U(s,t) represents the point mass at s). Next, we construct the mean 
vector 6 in Model (1.2) by 8j = 0j-\ + 2 < j < p. Last, we generate 

the data vector Y by Y ~ N(6, I p ). 
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CASE uses tuning parameters (5,m,Q,l pe ,u pe ,v pe ). Among these tun- 
ing parameters, (6, m, Q, £ pe ) are reasonably flexible. The optimal choice of 
(u pe , v pe ) depends on the unknown parameters (e p , r p ), and how to estimate 
them in general settings is a lasting open problem (even for linear models 
with orthogonal designs). However, we note that, first, Experiment 1.1(a) 
shows that if we mis-specify (e p , r p ) by a reasonably small amount and use 
them to decide the optimal choice of (u pe ,v pe ), then the misspecification 
usually has only a negligible effect on the performance of CASE. Second, in 
some cases, (e p ,T p ) can be estimated satisfactorily; see Experiment 1.1(b). 
For these reasons, in most experiments below, we set the tuning parame- 
ters in a way by assuming (e p ,r p ) (or equivalently, ($, r)) as known. To be 
fair, when we compare CASE with other methods, we also assume (e p , r p ) 
as known when we set the tuning parameters for the latter. 

In light of this, we set m = 3 when $ < 0.3, m = 2 when 0.3 < i? < 
0.5, and m = 1 otherwise. In this setting, any 5 £ (0, 1) gives the same 
graph Q*, so we take 5 = 0.5. Additionally, we set u pe = y / 21og(l/e p ) and 
v pe = T p . The choice of £ pe is heuristic and depends on how small is; 
in our numerical studies, l pe ranges from 10 to 35 in the case p = 5000 for 
different i?, and it ranges from 20 to 200 in the case p = 10 6 . Last, we take 
the patching method as described in (1.42) and then apply the PE-step. 

Experiment 1.1(a). In this experiment, we misspecify (e p ,r p ), say, as 
(e p ,r p ), and set u pe = -y/2 log(l/e p ) and v pe = f p , and investigate how 
the misspecification affects the performance of CASE. Fix (p, i9,T p ,a) = 
(5000,0.60,5,1), so that (e p ,r p ) = (0.006,5). We misspecify (e p ,T p ) by a 
small amount where we let f p vary in {4,4.5,- •• ,6}, and let e p vary in 
{0.005, 0.0055, •• • ,0.007}. Table 1 reports the average Hamming errors of 
50 independent repetitions. The results suggest that CASE is reasonably 
insensitive to the misspecification: the performance of CASE where (e p , t p ) 
are misspecified is close to the case where (e p ,r p ) are assumed as known. 

For comparison, we also investigate the performance of SaRa (see [24]), 
which is defined as 

1 i+h i 

Pf aRa = Wil{\Wi\> X}, where Wi = -( ^ Yj - £ Y 3 ). 

j=i+l j=i—h+l 

SaRa uses two tuning parameters (h, A) which we set ideally assuming (e p , r p ) 
as known: for all (h, A) satisfying h < [l/e p \ and A < r p , we choose (by ex- 
haustive numerical search) the pair that yields the smallest Hamming error. 
In this setting, the average Hamming error of SaRa is 9.02 (compare Table 
1). We see that CASE consistently outperforms SaRa, even when CASE 
uses the misspecified (e p ,r p ) to determine the tuning parameters (u pe ,v pe ), 
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Table 1 

Hamming errors in Experiment 1.1(a). p = 5000, # = 0.60 and t p = 5. The expected 
number of signals is p 1- ^ = 30. 















e p 
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4.5 
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5.5 
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0.0050 


5.50 


5.26 


5.04 


5.08 


5.22 


0.0055 


5.10 


5.04 


4.84 


4.82 


5.12 


0.0060 


5.02 


4.82 


4.78 


4.74 


4.98 


0.0065 


5.06 


4.86 


4.78 


4.76 


4.98 


0.0070 


5.26 


4.96 


4.84 


4.84 


5.00 



and SaRa uses the true values of (e p , t p ) to determine the tuning parameters 
(X,h). 

Experiment 1.1(b). In this experiment, we investigate the performance of 
CASE when (e p ,T p ) are unknown but can be estimated. We propose the 
following approach to estimate (e p ,r p ): 

, P , p 

i p = -Yl H\Wi\ > A}, r p = — V |Wi|l{|Wi| > A}, 

where W; = ^(YljLi+i Y j ~ H)=i-h+\ *S')> an d (\ h) are tuning parameters. 
Our numerical studies find that the approach works satisfactorily, especially 
when T p is moderately large and e p is moderately small. 

Fix (p,a,X,h) = (5000,1,4.5,5). We investigate different settings with 
•d G {0.60, 0.45} and r p G {4, • • • ,9}. We compare the performance of CASE 
where (u pe ,v pe ) are computed based on (e p ,f p ), CASE where (u pe ,v pe ) are 
computed based on (e p , r p ), and SaRa. Figure 4 summarizes the results based 
on 50 independent repetitions. The results suggest that two versions of the 
CASE have similar performance, which is substantially better than that of 
SaRa. 

Experiment 1.2. In this experiment, we compare CASE with the naive 
hard thresholding (nHT) introduced in Section 1.11. The tuning parameters 
of CASE are set in a way assuming (r p , e p ) as known. The threshold of nHT 
is set ideally as (r + 2??) 2 /(2r) • log(p) (where we also assume that (e p ,r p ) 
as known). Fix p = 10 6 and a = 1. Let i? range in {0.35,0.5,0.75}, and r p 
range in {5, • • • ,13}. Figure 5 summaries the average Hamming errors of 
50 independent repetitions. The results suggest that CASE outperforms the 
naive hard thresholding. 

Experiment 1.3. In this experiment, we compare the performance of three 
procedures, CASE, SaRa and the lasso, with a few representative pairs of 
(i?, T p ). Note here that the lasso estimate, /3' asso , is the minimizer of the 
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Fig 4. Hamming errors in Experiment 1.1(b) (p = 5000,). The x axis is t p , and the y axis 
is the ratio between Hamming error and p ■ 
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Fig 5. Hamming errors in Experiment 1.2 (p = The x axis is t p , and the y axis is 
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following functional 



min - II Y 

p 2" 



XPt + A||/3|| 



where A > is a tuning parameter. We use the the glmnet package [13] in 
the simulations. 

Fix p = 5000 and a = 1. We let ~Q range in {0.3, 0.45, 0.65} and t p range in 
{3, 4, • • • , 10}. The tuning parameters of CASE and SaRa are set ideally as in 
Experiment 1.1(a), assuming (e p ,T p ) as known. The lasso tuning parameter 
A is also set ideally (we calculate the whole solution path and choose the 
one with the smallest Hamming error). Table 2 displays the results based 
on 50 independent repetitions, which suggests that CASE outperforms the 
other two methods in most cases. 

In particular, the lasso behaves unsatisfactorily, due to the strong depen- 
dence among the design variables. Similar conclusion can be drawn in most 
of the examples considered in the section, but to save space, we only report 
that of the lasso here. 

Experiment 1.4- In this experiment, we let a > 1 so the signals may 
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Table 2 

Hamming errors in Experiment 1.3 (p = 5000,). 
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10 


0.30 


388.4 


CASE 
lasso 
SaRa 


212.8 
375.2 
245.3 


106.9 
373.3 
175.1 


52.4 
374.9 
106.4 


25.1 
371.9 
50.4 


14.8 
372.6 
21.2 


9.90 
378.1 
6.12 


7.64 
369.9 
2.98 


6.66 
374.0 
1.38 


0.45 


108.3 


CASE 
lasso 
SaRa 


56.6 
105.4 
76.0 


28.9 
106.1 
48.3 


11.6 
105.8 
31.1 


4.70 
102.6 
18.3 


1.68 
103.1 
9.16 


0.82 
106.2 
3.84 


0.72 
103.7 
1.76 


0.64 
105.1 
1.06 


0.65 


19.7 


CASE 
lasso 
SaRa 


11.8 
19.7 
14.5 


5.94 
18.3 
8.50 


2.36 
18.8 
5.42 


0.96 
19.2 
3.94 


0.38 
19.1 
2.16 


0.18 
20.1 
1.42 


0.12 
20.1 
1.06 


0.14 
19.4 
1.00 



Table 3 

Hamming errors in Experiment 1.4- p = 5000, i? = 0.5, p 1_ = 70.7 and t p — 4.5. 











a 










1 


1.5 


2 


2.5 


3 


half-half 


CASE 
SaRa 


14.26 
24.98 


6.32 
18.96 


5.50 
16.56 


4.78 
14.00 


4.56 
12.50 


all-positive 


CASE 
SaRa 


13.44 
24.26 


6.18 
18.58 


4.90 
16.80 


5.38 
13.66 


4.14 
12.12 



have different strengths. Fix (p, §, r p ) = (5000,0.50,4.5), and let a range 
in {1, 1.5, • • • ,4}. We investigate a case where the signals have the "half- 
positive-half-negative" sign pattern, i.e., (3j *~ (1 — e p )fo + ^fU{r p ,aT p ) + 
^-U(—aT p , —T p ), and a case where the the signals have the "all-positive" sign 

pattern, i.e., /3j *~ (1 — € p )vq + e p U(T p , ar p ). We compare CASE with SaRa 
for different values of a and sign-patterns. The results of 50 independent 
repetitions are reported in Table 3, which suggest that CASE uniformly 
outperforms SaRa for various values of a and the two sign patterns. 

3.2. Long-memory time series model. In this section, we consider the 
long-memory time seris model with a specific / as in (1.39) and (1.40). 
Fix {p, <p, i9, T p , a), where cj> is the long-memory parameter. We first use / to 

compute G and let X = G? 1 / 2 . We then generate the vector /3 by /3j *~ (1 — 
Cp)^o + -fU^Tp, aTp) + ^-U(—aT p , —Tp). Finally, we generate Y ~ N{Xj3, I p ). 

CASE uses tuning parameters (m, 5, £ ps , Q, £ pe , u pe , v pe ). In experiments 
below, we choose them as follows: m = 2, 5 = 0.35, u pe = \f2-d log(p) and 
v pe = y/2r\og{p). We take t{F,N) = q*(F, N) log(p), where q*{F,N) is 
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Table 4 

Hamming errors in Experiment 2.1 (4> = 0.35, p = 5000,). 
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12.0 


1.42 
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1.12 
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CASE 
lasso 


12.2 
11.4 


6.56 
6.24 


2.48 
2.74 


0.76 
1.14 


0.38 
0.36 


0.22 
0.18 


0.06 
0.08 


0.12 
0.00 



defined in (1.31), and 5 < £ ps < 10, depending on how large i? is (small i? 
corresponds to large £ ps ). £ pe is chosen in this way: for a certain range of 
integers, run the CASE for each and choose the largest integer such that 
each component of U* (as a subgraph Q + ) has a size < 10. In general, larger 
£ pe has better performance, but may result in longer computation time. 

Experiment 2.1. Fix p = 5000, = 0.35 and a = 1. Let $ range in 
{0.35, 0.5, 0.65}, and r p range in {4, • • • , 11}. We compare the performance 
of CASE with that of the lasso. The tuning parameters of CASE are set 
as above. The tuning parameters of the lasso are the oracle ones as in Ex- 
periment 1.3. The results based on 50 independent repetitions are summa- 
rized in Table 4. We see that CASE uniformly outperforms the lasso when 
i? = 0.35,0.5. When i? = 0.65, the performances of the two methods are 
similar. 

Experiment 2.2. In this experiment, we force the signals to appear in 
adjacent pairs or triplets. Fix p = 5000, <f> = 0.35, i? = 0.75 and let r p 
range in {5, • • • , 10}. We use '+— ' to denote the signal pattern 'pairs of 
opposite signs', '++' 'pairs of the same sign'. Other signal patterns are 
denoted similarly. To generate (3 corresponding to 'H — ', we first generate a 
(p/2) x 1 vector 9 by 0j l ~ (1 — e p )uo + ^U{T p ,aT p ) + ^-U(—aT pj —t p ), then 
let fyj-i = 0j and 02j = Oj- Similarly for other signal patterns. Figure 6 
displays the results of 50 independent repetitions. We see that in the four 

patterns 'H — ', '+H — ', £ H h' and 'H ', CASE uniformly outperforms 

the lasso when r p > 6. 

4. Discussion. Variable selection when the Gram matrix G is non- 
sparse is a challenging problem. We approach this problem by first sparsify- 
ing G with a finite order linear filter, and then constructing a sparse graph 
GOSD. The key insight is that, in the post-filtering data, the true signals 
live in many small-size components that are disconnected in GOSD, but 
we do not know where. We propose CASE as a new approach to variable 
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Fig 6. Hamming errors in Experiment 2.2. 



selection. This is a two-stage Screen and Clean method, where we first use 
a covariance-assisted multivariate screening to identify candidates for such 
small-size components, and then re-examine each candidate with penalized 
least squares. In both stages, to overcome the problem of information leak- 
age, we employ a delicate patching technique. 

We develop an asymptotic framework focusing on the regime where the 
signals are rare and weak so that successful variable selection is challeng- 
ing but is still possible. We show that CASE achieves the optimal rate of 
convergence in Hamming distance across a wide class of situations where G 
is non-sparse but sparsifiable. Such optimality cannot be achieved by many 
popular methods, including but not limited to the lasso, SCAD, and Dantzig 
selector. When G is non-sparse, these methods are not expected to behave 
well even when the signals are strong. We have successfully applied CASE to 
two different applications: the change-point problem and the long-memory 
times series. 

Compared to the well-known method of marginal screening [10, 32], CASE 
employs a covariance-assisted multivariate screening procedure, so that it is 
theoretically more effective than marginal screening, with only a moderate 
increase in the computational complexity. CASE is closely related to the 
graphical lasso [12, 22], which also attempts to exploit the graph structure. 
However, the setting considered here is very different from that in [12, 22] 
and our emphasis on optimality is also very different. 

The paper is closely related to the recent work [20] (see also [19]), but is 
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different in important ways. The work in [20] is motivated by recent litera- 
ture of Compressive Sensing and Genetic Regulatory Network, and is largely 
focused on the case where the Gram matrix G is sparse in an unstructured 
fashion. The current work is motivated by the recent interest on DNA-copy 
number variation and long-memory time series, and is focused on the case 
where there are strong dependence between different design variables so G 
is usually non-sparse and some times ill-posed. To deal with the strong de- 
pendence, we have to use a finite-order linear filter and delicate patching 
techniques. Additionally, the current paper also studies applications to the 
long-memory time series and change-point problem which have not been 
considered in [20]. Especially, the studies on the change-point problem en- 
compasses very different and very delicate analysis on both the derivation 
of the lower bound and upper bound which we have not seen before in the 
literature. For these reasons, the two papers have very different scopes and 
techniques, and the results in one paper cannot be deduced from those in 
the other. 

The main results in this paper can be extended to much broader settings. 
For example, we have used a Rare and Weak signal model where the signals 
are randomly generated from a two-component mixture. The main results 
continue to hold if we choose to use a much more relaxed model, as long as 
the signals live in small-size isolated islands in the post-filtering data. 

In this paper, we have focused on the change-point model and the long- 
memory time series model, where the post-filtering matrices have polynomial 
off-diagonal decay and are sparse in a structured fashion. CASE can be 
extended to more general settings, where the sparsity of the post-filtering 
matrices are unstructured, provided that we modify the patching technique 
accordingly: the patching set can be constructed by including nodes which 
are connected to the original set through a short-length path in the GOSD 

g*. 

Another extension is that the Gram matrix can be sparsified by an op- 
erator D, but D is not necessary linear filtering. To apply CASE to this 
setting, we need to design specific patching technique. For example, when 
D^ 1 is sparse, for a given I, we can construct I pe = {j : \D~ 1 (i,j)\ > 
Si, for some i € X}, where 5\ is a chosen threshold. 

The paper is closely related to recent literature on DNA copy number 
variation and financial data analysis, but is different in focus and scope. It 
is of interest to further investigate such connections. To save space, we leave 
explorations along this line to the future. 



42 



T. KE, J. JIN AND J. FAN 



5. Proofs. This section is organized as follows. In Section 5.1, we state 
and prove three preliminary lemmas, which are useful for this section. In 
Sections 5.2-5.12, we give the proofs of all the main theorems and lemmas 
stated in the preceding sections. 

5.1. Preliminary lemmas. We introduce Lemmas 5.1-5.3, where Lemmas 
5.1-5.2 are proved below, and Lemma 5.3 is proved in [20, Lemma 1.4]. 

Recall that B = DG and Q* is the GOSD in Definition 1.3 with 5 = 
l/log(p). Introduce the matrix B** by 

B**(i,j) = B(i,j) • l{j G S({i}) }, l<i,j<P, 

where for any set V C {1, ■ • • ,p}, 

£{V) = {k : there is an edge between k and k' in Q* for some k' G V}. 

Recall that A4 p (a, Aq) is the class of matrices defined in (1.7). 

Lemma 5.1. When B G M p (a,A ), Q* is K p -sparse for K p < C[\og(p)] 1 / c 
and \\B - B** <C[\og{p)]^ l - l / a \ 

Proof. Consider the first claim. Since B G Ai p (a,Ao) and H(i,j) = 
Sfc=o VkB(i, j + k), there exists a constant A' > such that H G Ai p (a, A' ). 
Let K p be the smallest integer satisfying 

^ p >2[max(A ,^)log(p)] 1 /«, 

where it is seen that K p < C(log(p)) 1//a . At the same time, for any i,j such 
that — + 1 > Kp/2, we have \B(i,j)\ < 5, \B(j,i)\ < 5 and \H(i,j)\ < 5. 
By definition, there is no edge between nodes i and j in Q* . This proves that 
Q* is -ftTp-sparse, and the claim follows. 

Consider the second claim. When \B(i,j)\ > 5, there is an edge between 
nodes i and j in Q* , and it follows that {B — B**)(i,j) = 0. Therefore, for 
any 1 < % < p, 

J2\(B-B**)(iJ)\ < £ \B(i,j)\+ Yl 

3=1 j:\j~i\+l>K p /2 j-\j-i\+l<K p /2,\B(i,j)\<8 

= I + 11, 

where / < 2yl J2 k +i>K p /2 k ~° < CK^~ a and II < K p 5 = CK X p- a . Recall- 
ing Kp < C[log(p)]°, \\B - B**^ < CK x v - a < C[log(p)]-( 1 - 1 / a ), and the 
claim follows. □ 
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Next, recall that Q + is an expanded graph of Q*, given in Definition 1.7, 
and X <d Q denotes that X is a component of Q, as in Definition 1.8. 

Lemma 5.2. When Q* is K-sparse, Q + is K{2i pe + l) 2 -sparse. In addi- 
tion, for any set V C {1, • • • ,p}, let Qy be the subgraph of Q + formed by 
nodes in V. Then for any X<Q^, (V\l) n £(l pe ) = 0. 

Proof. Consider the first claim. It suffices to show that for any fixed 
1 < % < p, there are at most K(2i pe + l) 2 different nodes j such that there is 
an edge between i and j in Q + . Towards this end, note that {i} pe contains 
no more than (2£ pe + 1) nodes. Since Q* is /f-sparse, for each k G {i} pe , 
there are no more than K nodes k' such that there is an edge between k and 
k' in Q* . Again, for each such A/, there are no more than (2l pe + 1) nodes j 
such that k' G {j} pe - Combining these gives the claim. 

Consider the second claim. Fix V and X < Qy. Since X is a component, 
for any i G X and j G V\X, there is no edge between i and j in ^y. By 
definition, this implies {j} pe n <5({i} pe ) = 0, and especially j £ £({i} pe ). 
Since this holds for all such i and j, using that £{X pe ) = Uj g j<?({i} pe ), we 
have (V\X) n «S(XP e ) = 0, and the claim follows. □ 

Finally, recall the definition of p*j{"&, r, a, G) in (1.27) and that of ip(F, N) 
in (1.33). 

Lemma 5.3. When a > a*(G), p*($,r,a,G) does not depend on a and 
Pj(#, r , a , G) = Pj(#, r , G) = min( F>7V):jejFiFnA r=0,F^0 ^(F, N). 

5.2. Proof of Lemma 1.2. For preparation, note that the Fisher Infor- 
mation Matrix associated with model (1.13) is 

Q = (B X+ ' X )'(H X+ ' X+ )~ 1 (B Z+ ' X ). 

Write D x = D X+ > J+ and d = G J+ > J+ for short. It follows that B X+ > J+ = 
Did &ndH x+ < x+ = DiGiD[. Let T be the mapping from J + to {l,-- - ,\J + \] 
that maps each j G J + to its order in J + , and let X\ = X(X). By these 
notations, we can write 

(5.67) Q = Q Xl ' X \ where Qx = GxD'^DxGxD'^DxGx. 
Comparing (5.67) with the desired claim, it suffices to show 

(5.68) Qx = Gx - U(U , G^ 1 U)- 1 V. 
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Let R = DiG] 12 and P R = R^RR'^R. It is seen that 

(5.69) Qi = G{ /2 P R G{ /2 = Gi - G\ /2 (I - Pr)G\ /2 . 

Now, we study the matrix / — Pr. Let k = \ J + \, and denote S{R) the row 
space of R and J\f(R) the orthogonal complement of S(R) in By con- 
struction, Pr is the orthogonal projection matrix from M. k to S(R). Hence, 
/ — Pr is the orthogonal projection matrix from M. k to Af(R). By definition, 
M{R) = { v e R k : Rt] = 0}. Recall that R = DxG] 12 . Therefore, Ri] = if 
and only if there exists £ G M fc such that 7] = G 1 1 ^ 2 ^ and Z?i£ = 0. At the 
same time, Null(Z + , J + ) = £ K fc : = 0}. Combining these, we have 

(5.70) M{R) = {G~ 1/2 £ : £ G Null{Z + , J + )} . 

— 1/2 

Introduce a new matrix V = G 1 U. Since the columns of U form an or- 
thonormal basis of Null(Z + , l 7 + ), it follows from (5.70) that the columns of 
V form a basis (but not necessarily an orthonormal basis) olN{R). Conse- 
quently, 

(5.71) I-Pr = V{V'V)- l V = G~ 1/2 U(U'G^ l U)- l U'G~ 1/2 . 
Plugging (5.71) into (5.69) gives (5.68). □ 

5.3. Proof of Lemma 1.4- Write p* = p*(t?, r, G) for short. It suffices to 
show for any log(p) < j < p — log(p), there exists (Vq, V\) such that 
(5.72) 

p(V , V x ) < p* + o(l), j G (Vo UV 1 )c{j + i:- log(p) < i < log(p)}. 

In fact, once (5.72) is proved, then d p (Q°) < 21og(p) + 1, and the claim 
follows directly. 

We now construct (Vo,Vi) to satisfy (5.72) for any j such that log(p) < 
j < p — log(p). The key is to construct a sequence of set pairs (Vq , V-f^) 
recursively as follows. Let V^ = and = V*j, where (Vqj , V*j) are as 
defined in Section 1.8. For any integer t > 1, we update {Vq \ V®) as follows. 
If all inter-distance between the nodes in Vq U (assuming all nodes are 
sorted ascendingly) does not exceed \og(p)/g, then the process terminates. 
Otherwise, there are a pair of adjacent nodes i\ and %2 in (Vq U V 1 ) (again, 
assuming the nodes are sorted ascendingly) such that 12 > i\ + log(p) / g. In 
our construction, it is not hard to see that j G Vg U V 1 . Therefore, we 
have either the case of j < i\ or the case of j > 12- In the first case, we let 

N (t+i) = N (t) n { i:i < ^ F (t+i) = F (t) n { i:i < k y : 
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and in the second case, we let 

N (t+l) = N (t) n { i:i > i2 } ; F (t+l) = F (t) n { i:i > k}, 

where N® = V (t) n v} t] and F® = (V^ U V^) \ N (t \ We then update by 
defining 

V^ t+1) = A^ +1 ) U F', V} t+1) = A^ +1) U F" 

where (F', F") are constructed as follows: Write F® = { ji , j2, ■ ■ ■ ,jk} where 
h < 32 < ■■■ < 3k and k = \F®\. When k is even, let F' = {ji, • • • ,j k / 2 } 
and F" = FW\F'; otherwise, let F' = {jx, ■ ■ ■ , j (fc _i )/2 } and F" = F^\F'. 

Now, first, by the construction, \F^ U N®\ is strictly decreasing in t. 
Second, by [20, Lemma 1.2], \FW U N^\ < |F *. U V£| < 5. As a result, the 
recursive process above terminates in finite rounds. Let T be the number of 
rounds when the process terminates, we construct (Vb,Vi) by 

(5.73) V = VP, Vi = v} T) . 

Next, we justify (Vb,Vi) constructed in (5.73) satisfies (5.72). First, it is 
easy to see that j £ VqU V\ and |Vo U V\\ < g. Second, all pairs of adjacent 
nodes in Vq U V\ have an inter-distance < log(p) / g (assuming all nodes are 
sorted), so (Vo U V\) C {j — log(p), • • • ,j + log(p)}. As a result, all remains 
to show is 

(5.74) p(V ,Vx) < P * + o(l). 

By similar argument as in [20, Lemma 1.4] and definitions (i.e. (1.33) and 
[20, (1.23)]), if a > a*(G), then for any (Vg, V[) such that \]% U V{\ < g, we 
have p{Vq,V{) > Tp(F',N'), where N' = Vg n V[ and F' = (V^ U V()\N'. 
Moreover, the equality holds when |Vg| = |V^'| in the case \F'\ is even, and 
l^o I ~~ l^i'l = m the case \F'\ is odd. Combining these with definitions, 

ptYoM) = ^ t \n^), p* = p(v$j,v%) = p(f ( Vi (1) ) > v(^ (1) ,iv (1) ). 

Recall that T is a finite number. So to show (5.74), it suffices to show for 
each 1 < t < T- 1, 

(5.75) iP(F( t+1 \ Ar(* +1 )) < ip(F®,N®) + o(l). 

Fixing 1 < t < T - 1, write for short F = F®, N = N®, Ni = 
and Fx = F( t+1 \ Let 1 = F U N and T\ = F x U N x . With these notations, 
(5.75) reduces to 



(5.76) 



ip(Fi,Ni) < tp(F, N) + o(l). 
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By the way tj) is defined (i.e., (1.33)), it is sufficient to show 
(5.77) u(F x ,N x )<u(F,N)+o(l). 

In fact, once (5.77) is proved, (5.75) follows by noting that + 2|JVi| < 



\F\ + 2\N\ 
We now 


- 1. 

show (5.77). Letting = diag(G : 


u(F, N) 




min e'G X ' X 6 
9mW:\ei\>i,vieF 




> 


min 9'n9 - r 

6»eMl I l:|0 4 |>l,ViGF fleRl 1 ! 


(5.78) 


> 


min e'G Xl ' Xl 9 - 






w(Fi,iVi)- max \ff 

6»eMl I l:|e i |<2a,Vi 



nax \Q\G x - fi)0| 

:|6»i|<2a,Vi 

max \6'(G X ' X - Sl)6\ 

6»eKl I l:|9 i |<2a,Vi 
'(G x ' x -fl)6\, 

where in the first and last equalities we use equivalent forms of ui(F,N), in 
the second inequality we use the fact that the constraints > 1 can be 
replaced by 1 < \6i\ < a for any a > a* and the triangular inequality, and 
in the third inequality we use the definition of £1. 

Finally, note that for any k G X\ and k' G X\X\, \k — k'\ > \og(p)/g 
holds. In addition, G has polynomial off-diagonal decays with rate 7 > 0. 
Together we find that \\G X > X - 0|| < C '(log(p) / ' g)~^ = o(l). As a result, 
max 06R |x| : | e .|< 2a)Vi \8'(G XX - n)6\ < Co 2 ■ \\G X ? - fi|| • \T\ = o(l). Inserting 
this into (5.78) gives (5.77). □ 

5.4. Proof of Theorem 1.3. First, we define p* ts (79, r;/) as follows. For 
any spectral density function /, let G°° = G°°(f) be the (infinitely dimen- 
sional) Toeplitz matrix generated by /: G°°(i,j) = f(\i — j\) for any i,j G Z, 
where f(k) is the k-th. Fourier coefficient of /. In the definition of p(Vo, V\) 
in (1.25)-(1.26), replace G by G°° and call the new term p°°(Vo, Vi). For any 
fixed j, let 

(5.79) Phts(#,r;f) = fv min p^VoM), 

where Vq, V\ are subsets of Z. Due to the definition of Toeplitz matrices, 
p* ugi^&i r; /) does not depend on j, so we write it as p* lts {fl,r;f) for short. 
By (5.72), it is seen that 

(5.80) p*(i?,r,G) = p* lts {$ } r-f) + o(l), for any log(p) < j < p - log(p). 

Now, to show the claim, it is sufficient to check the main conditions of 
Theorem 1.2. In detail, it suffices to check that 
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(a) G G Mpd, g, co, Ai) with 7 = 1 - 2<p > 0, > and Co > 0. 

(b) B G M p (a, A ) with a = 2 - 2<p > 1 and A > 0. 

(c) Conditions RCA and RCB hold with k = 2 - 20 > and ci > 0. 

To show these claims, we need some lemmas and results in elementary 
calculus. In detail, first, we have 

(5.81) 1/ HI < CVr^ +1 \ \f"(oj)\ < C\oj\-^ +2 \ 

For a proof of (5.81), we rewrite f(uj) = /*(w)/|2sin(w/2)| 2 ^, where by 
assumption f*(u>) is a continuous function that is twice differentiable except 
at 0, and \(f*)"(uj)\ < C|u;| _2 . It can be derived from basic properties in 
analysis that 

(5.82) |(r)"H| <C| W |" 2 , KD'HI^CVr 1 , and |/»| < C. 
At the same time, by elementary calculation, 

i/'hi < c| W r^ +i )(irMi + | W (r)'MD, 

< ciwr^drMi+iwcr/cwji+i^c/THi), 

and (5.81) follows by plugging in (5.82). 

Second, we need the following lemma, whose proof is a simple exercise of 
analysis and omitted. 

Lemma 5.4. Suppose g is a symmetric real function which is differen- 
tiable in [— 7r,0) U (0, 7r] and \g'(to)\ < C\ui\~ a for some a G (1,2). Then as 
x — > 00, cos{ujx)g{oj)duj = O (|x|~( 2 ~ a )) . 

We now show (a)-(c). Consider (a) first. First, by (5.81) and Lemma 5.4, 
J^cos(ku)f(uj)duj < Ck-^ 1 - 2 ^ for large k, so that \G(i,j)\ < C(l + \i - 
j\) ~( 1_2< ^. Second, by well-known results on Toeplitz matrices, A m i n (G?) > 
min^gr,,.^] f(uj) > 0. Combining these, (a) holds with 7 = 1 — 20 and 
c = min^^] f(u). 

Next, we consider (b). Recall that B = DG where D is the first-order row- 
differencing matrix. So B(i,j) = ^ [cos(/cw) — cos((fc + f(uS)duj, 
where k = i — j. Without loss of generality, we only consider the case k > 1. 
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Denote g(oj) = ujf(uj). By Fubini's theorem and integration by part, 
i-k+l 



B(i,j) 



u) sm(uix)dx 



f(ui)du 



k+l 



k+l 



g{u) s'm(ujx)duj 



-g(ir) + 



x 



dx 



71 cos(wx) , 

g (cj)du 



7T 

h + h 



k+l 



C0S(7TX) 



dx 



X 





1 

2tt 



X 

k+l 1 

X 



dx 

cos(w x) g' (uj)duj 



dx 



t-V*) Ik +1 ^dx 



0(k~ 



First, using integration by part, |ii| 

Second, similar to (5.81), we derive that g"(oS) = 0(|u;|-( 1+2 ^). Applying 
Lemma 5.4 to g' , we have | cos(cux)g' (oj)dui\ < C\x\~^~ 2 ^\ and so I/2I < 

f* +1 Cx-( 2 - 2 ^dx = 0{k-^ 2 ^). Combining these gives \B(i,j)\ < C(l + 
\i - j|)~ (2_2</>) , and (b) holds with a = 2 - 2(f). 

Last, we show (c). Since <fr](z) = 1 — z, RCA holds trivially, and all 
remains is to check that RCB holds. Recall that H = DGD' , where D is 
the first-order row-differencing matrix. The goal is to show there exsits a 
constant c\ > such that for any triplet (k, b, V), 



(5.83) 



b'H V ' V b> Cl k-^- 2 ^\\b\\ 2 , 



where 1 < k < p is an integer, b £ M. k is a vector, and V C {1, 2, . . . , p} is a 
subset with |V| = k. 

Towards this end, we introduce /i(w) = 4sin 2 (w/2)/(w), where we recall 
that / is the spectral density associated with G. Fixing a triplet (k, b, V), we 
write b = (b 1 , b 2 , ■ ■ ■ , b k )' and V = {ji, ■ ■ ■ ,j k } such that ji < 32 < ■ ■ ■ < 3k- 
By definitions and basic algebra, 

H(i,j) = G(i,j)-G(i + l,j)-G(i,j + l) + G(i + l,j + l) 

1 r 

= — / [2 cos(kuj) — cos((k + i)co) — cos((fc — l)w)] f{uj)duj 

271 " J-7T 

1 

27 



cos(kuj) fi(u})du , 



where for short k 



3, 



which, together with direct calculations, implies that 

k k 



b'H V ' V b 



1 

2^ 



^2 ^2 h sh cos{{j s -j t )u)fx(u))du 



s=l t=l 



2tt 



Ysbse^^h^du. 



8=1 
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At the same time, note that /i(w) > C^l 2 2< ^ for any and \oj\ < it. 

Combining these with symmetry and monotonicity gives 

(5.84) 



b'H v ' v b > - r \Yb s e^*"\ 2 uj 2 - 2 *duj > - r lYbse^^fu^dco. 

k Jo rrf t .Ar/ran f-f 

Next, we write 



S=l s = 1 



(5.85) 



- \Y b se V ^ TjsU '\ 2 duj = I + II, 

w 



where I and 1/ are the integration in the interval of [0, ir/(2k)] and [n/(2k),n], 
respectively. By (5.84) and the monotonicity of the function uj 2 ~ 2 ^ in [n/(2k), 7r], 



(5.86) b'H v > v b > Ck^ 2 - 2 ^-- r \Yb s e^°"\ 2 du> = Ck'^ 2 ® -II. 

At the same time, by the Cauchy-Schwartz inequality, | Y^=i b s e^~^^ 3W ^' < 

(Etil^^| 2 )(Etil^| 2 ) = ^ll&ll 2 , and so I < ± $^ 2k) k\\b\\ 2 du < 
||6|| 2 /2. Inserting this into (5.85) gives 

(5.87) II>\\b\\ 2 -\\b\\ 2 /2=\\bf/2, 

and (5.83) follows by combining (5.86) and (5.87). □ 

5.5. Proof of Lemma 1.5. First, we show r% (•&) = r% ($; f) is a decreas- 
ing function of Similarly to the proof of Theorem 1.3, in the definition 
of u(F,N) and ip(F,N) (recall (1.33) and (1.34)), replace G by G°° , and 
denote the new terms by u°°(F,N) = u°°(F,N;ti,r, f) and V>°°(F,iV) = 
ip°°(F, N; $, r, /), respectively. By similar argument in Lemma 5.3, 

p*(tf r .f\= min ip°°(F,N). 

For each pair of sets (F, N) and ■& E (0, 1) let r*(tf; F, N) = r*(tf; F, N, f) be 
the minimum r such that ^°°{F, N\ $,r, f) > 1. It follows that 

rtJti) = max r*($:F,N). 

{F,N):FnN=<b,F^<b 

It is easy to see that r* F, N) is a decreasing function of $ for each fixed 
(F,N). So ff ts ($) is also a decreasing function of #. 
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Next, we consider lim$_>i r* ts (i9). In the special case of F = {j} and 
N = 0, w°°(F,iV) = i 7 lim^ir*(i?;F,iV) = 1, and so liminf*_n r? to (0) > 1. 
At the same time, for any (F, N) such that \F\ + \N\ > 1, tp°°(F, N) > ■& and 
so lim^i r* (i?; F, N) < 1. Hence, lim sup^x < 1. Combining these 

gives the claim. 

Last, we consider lim,j_>o First, since lim^o ip°°(F, N) = u)°°(F, N)r/4 

for any fixed (F,N), we have 

-i 



(5.88) Bm n 4(tf) 



min w°°(F,jV) 

(F,AT):FnA r =0,F^0 



Second, by definitions, 
(5.89) 

min uj°°(F,N)= lim min tj(F.N), 

(F,N):FnN=d,F^d P^oo (F,N):(FUN)G{1,- ,p},FnN=d,F^d 

whenever the limit on the right hand side exists. 

Third, note that (a) Given F, uj(F, N) decreases as N increases and (b) 
Given F U N, oj{F, N) decreases as N increases (the proofs are straightfor- 
ward and we omit them). As a result, for all (F,N) such that (F U N) C 
{1, • • • ,p}, uj(F,N) is minimized at F = {j} and N = {1, • • • ,p}\{j} for 
some j, with the minimal value equaling the reciprocal of the j-th diagonal 
of G^ 1 . In other words, 
(5.90) 

lim min ui(F.N) = \ lim max G~ l (j,j)] 1 . 

p->oo (F,N):(FUN)C{1,- ,p},FnN=$,F^$ L p^oo l<j<p J 

Fourth, if we write G = G p to emphasize on the size of G, then by basic 
algebra and the Toeplitz structure of G, we have (G~ 1 )(j,j) < (G~^_ k )(j + 
k,j + k) for all 1 < k < P ~ j and (G p l ){j,j) < (Gj£ k )(j - k,j - k) for 
1 < k < j — 1. Especially, if we take k = log(p), then it follows that 

(5.91) lim max G~ l (j,j) = lim max G _1 (j,j). 

p->oo l<j<p p->oo log(p)<j'<p-log(p) 

Last, we have the following lemma which is proved in Appendix A. 

Lemma 5.5. Under conditions of Lemma 1.5, 

1 f w 

lim max C _1 (j,j) = — / f~ 1 (uj)du>. 

p^oo log(p)<j<p-log(p) 27T J_ 7T 

Combining (5.88)-(5.91) and using Lemma 5.5, 

hmrftM = 4 • [ Hm max G~\j,j)] = - f /"V)^. 

tf->0 P^°°log(p)<J<p-log(p) 7T J-tt 

□ 
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5.6. Proof of Theorem 14. Write for short j3 = p case and p* cp = p* p (#, r). 
It suffices to show 

(5.92) Hamm^, r, G) > Lpp 1 -^; 
and for any p G @*(r p ,a), 

v 

(5.93) flp(4; e p! p, G) = ^ P(sgn(&) / sgn(/%)) < L^ 1 "^ + o(l). 

j=l 

First, we show (5.92). The statement is similar to that of Theorem 1.1, 
but d p {Q <> ) < L p does not hold. Therefore, we introduce a different graph 
Q y as follows: Define a counter part of r, G) as 

(5.94) p*Mr,G)= min p(y ,Vi), 

J (Vo,Vl):min(V UVi)=j 

where min(VbUVi) = j means j is the smallest node in VbUVi. Let (V * , V*j) 
be the minimizer of (5.94), and when there is a tie, pick the one that appears 
first lexicographically. Define the graph Q y with nodes {1, • • • ,p}, and that 
there is an edge between nodes j and k whenever (V * uy i *)n(V r * fc U Vj* fc ) 7^ 0. 

Denote d p {Q y ) the maximum degree of nodes in Q y . Similar to Theorem 
1.1, as p — > 00, 



(5.95) Hammer, G) > L p [d p (g y )}- 1 J^p 

3=1 



-pJ(*,r,G) 



The proof is a trivial extension of [20, Theorem 1.1] and we omit it. Moreover, 
the following lemma is proved below. 

Lemma 5.6. As p ^ 00, max log ( p )< i < p _ log(?) ) \p*(tf,r,G) - p* cp ($,r)\ = 
o(l), and d p {Q y ) < L p . 

Combining (5.95) with Lemma 5.6 gives (5.92). 

Second, we show (5.93). The change-point model is an 'extreme' case and 
Theorem 1.2 does not apply directly. However, once we justify the following 
claims (a)-(c), (5.93) follows by similar arguments in Theorem 1.2. 

(a) SS property: 

p 

2 p (h * °>i i K) ^ Lpp 1 -^ + (i). 

i=i 
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(b) SAS property: If we view U* as a subgraph of Q + , there is a fixed 
integer Iq > such that with probability at least 1 — o(l/p), each 
component of U* has a size < Iq . 

(c) A counter part of Lemma 2.6: For any log(p) < j < p — log(p), and 
fixed X < £? + such that j € X and |X| < Zo> suppose we construct 

using the process introduced in the PE- 
step, and j 6 X( fc ). Then for any pair of sets (Vo, 14) such that X( fc ) = 

where ^(Fq, Fi;X^) is defined in (2.59). 

Consider (a) first. Following the proof of Lemma 2.1 until (5.119), we find 
that for each log(p) < j < p — log(p), 

(X,F,A r ):ieX<g*,|X|<m,FUAf=X,FnAf=0,F^0 

+L p p-( m+1 )' ? + o(l/p) 

where u = T- 2 {p F )'[Q F > F - Q F > N (Q N > N )~ 1 Q N ' F }/3 F and Q is defined as 
in (1.15). First, by the choice of m, L p p~( m+1 ^ < L p p~ Pc p. Second, using 
similar arguments in Lemma 2.1, the summation contains at most L p terms. 
Third, by (1.35), ojq > ui(F, N). Combining the above, it suffices to show for 
each triplet (X, F, N) in the summation, 

(5.96) |X|tf + [(y/Q(F,N)r - ^q) + ] 2 > p* cp . 
The key to (5.96) is to show 

(5.97) u(F,N) > 1/2. 
Once (5.97) is proved, since q < j(\/2— l) 2 , 

+ [(y/u,(F,N)r - ^q) + } 2 > |X|^ + r/4 > p* cp , 

where in the last inequality we use the facts p* cp < i? + r/4 and |X| > 1. This 
gives (5.96). 

All remains is to show (5.97). We argue that it suffices to consider those 
(X, F, N) where both X(= F U N) and F are formed by consecutive nodes. 
First, since G is tri-diagonal, the definition of Q* implies that any X< Q* is 
formed by consecutive nodes. Second, by (1.35) and basic algebra, 

(5.98) u,(F,N)= min ^[(Q- 1 )^]" 1 ^ 

~ |F M6I>1 
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where Q is defined in (1.15). Note that B is an identity matrix and X ps = X. 
So Q _1 = H x,x , which is a tri-diagonal matrix. It follows from (5.98) that 
if F is not formed by consecutive nodes, there exist F\ C F and N\ = X\F\ 
such that u(Fi,Ni) < 6j(F,N). The argument then follows. 

From now on, we focus on (I, F, N) such that both I and F are formed 
by consecutive nodes. Elementary calculation yields 

(5.99) [(Q- 1 )^]- 1 = (H^r 1 = ^ ~ t^W, 

k + 1 

where k = \F\, is the k x k matrix defined by Q^ k \i,j) = mm{i,j} 
and r/ = (1, • • • , k)' . We see that ui(F, N) only depends on k. When k = 1, 
uj(F,N) = 1/2 by direct calculations following (5.98) and (5.99). When 
k > 2, from (5.98) and (5.99), 

X> + • • • + Zk? - r^r(6 + 26 + • • • + A^) 2 . 
.1=1 

Let = ^2*j=i Cj- The above right hand side is lower bounded by Y2i=i s f ~ 

(E?=i*0 2 /* = J2i<A s i - si') 2 /k, where J2i<A s i ~ s i'? > Efci(*H-i ~ 
si) 2 > — 1- Therefore, 

u(F,N) >(k-l)/k> 1/2. 

This proves (5.97). 

Next, consider (b). We check RCB, and the remaining proof is exactly the 
same as in Lemma 2.2. Towards this end, the goal is to show there exists a 
constant c\ > such that for any (k, V) where V C {1, • • ■ ,p} and k = \V\, 

(5.100) Xmin(H V ' V ) > cik~ 2 . 

Since H is tri-diagonal, it suffices to show that (5.100) holds when V is 
formed by consecutive nodes, i.e., V = {j, - ■ ■ ,j + k} for some 1 < j < p — k. 
In this case, we introduce a matrix Y>( k \ which is 'smaller' than H v,v but 
much easier to analyse: 

j) = 2 • 1{% = j} -l{\i- j\ = 1} - l{i = j = k}, 1 < i, j < fc. 
It is easy to see that H v ' v - £W is positive semi-definite. Hence, 

(5.101) A min (i^ y ) > A min (S«). 



^(i 7 , iV) = min 
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Observing that (E^) -1 = where SlW is as in (5.99), we have 

(5.102) A min (SW) = [Arnax^)]" 1 > [|| Q ( fc > || «,] ~ X = 2/(fc 2 + k) . 

Combining (5.101)-(5.102) gives (5.100). 

Finally, consider (c) . Fix 1 < j < p and the triplet (X^ , Vq, V\) , where 
\1 {k) \ < k- The goal is to show 

(5.103) pjiVo^l^^ptp + oil). 

Introduce the following quantities: From the PE-step and the choice £ pe = 
2 log(p), we can write 

j(fc), P e = {ii + 1) ... J 1+L} and X« = { il + Ml) ... ,j 1+ M 2 }, 

where the integers L, M\ and Ma staisfy 
(5.104) 

M 2 -M 1 <1 + 1, Mi > pogtp^/Co+i), (L - M 2 )/M 1 > [log^lVCo+i). 

Denote if = M 2 - M x + 1, M = M x - gfr and I" = {M , • • • , M + K - 1}. 
Let J- be the one-to-one mapping from X^ to X" such that .F(i) = i — (ji + 
Mi) + M . Denote V " = T(V ) and Vj' = J"(Vi). Recall the definitions of 
Wj(V Q ,Vi;XW) and w*{V ,Vi) (see (2.59) and (1.25)). We claim that 

(5.105) w 3 (V ,V i; X^) > w*(V^',V{') + o(l). 

Once we have (5.105), plug it into the definition pj(Vo,Vi;X^) and use 
the monotonicity of the function f(x) = [(x — a/x)+] 2 over (0, 00) when 
a > 0. It follows that 



Pi 



{V ,Vr,xW) >max{|V |,N}tf+- 



\Vi\-\V o \\0' 



+ 0(1). 



where w* is short for w*(Vq , Vf). Compare the first term on the right hand 
side with (1.26) and recall that |Vq'| = \Vq\ and \V{'\ = \Vi\- It follows that 

(5.106) Pj (V , v i; x^) > p(y&', V{') + o(l). 

Moreover, since M = min(y o ", V"), by (5.94), 

(5-107) pK,V{')>P*Mo- 
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Note that (5.104) implies M > Mi > [log(p)] 1/(1+ ' o) . By a trivial extension 
of Lemma 5.6, we can derive that max (log(p)) i / (i +io )< i < p _ {log(p)) i/ ( i +;o) \p* - 
P*p\ = These together imply 

(5.108) ~P*M =Pl P + o{l). 

Combining (5.106)-(5.108) gives (5.103). 

What remains is to show (5.105). The proof is similar to that of (5.152). 
In detail, write for short wj = m j (V Q ,V 1 ;X( k )), w* = m*(yg ,V{'), B 1 = 

Bl (k),pe^ k) ^ ^ = fflW , P e )l(fe ),pe ^ ^ = ^-1 ^ g y arguments 

in (5.153), Wj > min Jg iroj, and there exists a constant a\ > such that 
I minro,- -w*\< max \^'(G X "' X " - Qi)f I < C\\G X "> X " - Qi\\. 

Therefore, it suffices to show that 

(5.109) \\G X "' X " -Qi|| =o(l). 

Note that Qi is the (X',:Z;')- block °f #f\ where the index set T = {M 1 , • • • , M 2 }. 
By (5.99), i?f 1 = fiW - j^iji/', where i] = (1, 2, • • • , L)'. It follows that 

Qi = (Mi - l)l K l' K + ^ W " ^i^i 

where 1^ is the ET-dimensional vector whose elements are all equal to 1, 
and £ = (Mi, • • • , M 2 )'. Define the L x L matrix A by A(i, j) = '-^j^, for 
1 < iji < X and let Ai be the submatrix of A by restricting the rows and 
columns to I' . By these notations, 

Q 1 = (M -l)l K l' K + nW -At. 

At the same time, we observe that 

G^" = (M„-1)1^ + ^. 

Combining the above yields that G X "' X " — Qi = Ai. Note that |A(i, j)| < 

Ml-Ml < (i +l)(2Mi+io + l) = o(1) for all ^ £ j/ Hencej || Ai || = o(1) and 

(5.109) follows directly. □ 

5.6.1. Proof of Lemma 5.6. To show the claim, we need to introduce 
some quantities and lemmas. First, by a trivial extension of Lemma 5.3, 

m($,r,G)= min 4(F,N). 

J (F,N):mm(FUN)=j,FnN=®,Fjtf& 
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where ip(F, N) = if>(F, N; r, G), defined in (1.33). 

Second, let 1Z P denote the collection of all subsets of { 1 , • • • , p} that are 
formed by consecutive nodes. Define 

ft(0, T ,G)= min 4>{F,N), 

J (F,N):mm(FUN)=j,FnN=<!>,F=ji<!>,FUN<Ellp,F£Tlp,\F\<3,\N\<2 

where we emphasize that the minimum is taken over finite pairs (F, N). The 
following lemma is proved in Appendix A. 

Lemma 5.7. As p ^ oo, max log(j ,)< i < p _ log ( p ) \p* j ('&,r,G) -/}*(-&, r,G)\ = 
o(l). 

Third, for each dimension k, define the k x k matrix as 

(5.110) X {k \i,j) = 2-l{i = j}-l{\i-j\ = 1}, 

except that Tii k \l, 1) = T,i k \k, k) = 1, and the k x k matrix 0* as 

(5.111) n[ k \i,j)=mm{i,j}-l. 
Let 



U (P°)(F,N) 



*eRl*"l:|6l>l 

^mm-.^i^o^k, \N\ = 0, \F\ > l 
I oo, I AT| =0, \F\ = 1 



and define ^°°\F,N) = ^°°\F, N; r, G), a counter part of if)(F,N), by 
replacing oj(F, N) by u^°\F,N) in the definition (1.33). Let 

p(°°)(0 r ) = min ^°°\F,N), 

(F,N):min(FUN)=l,FnN=$,F^$,Fen p ,FlJNEn p ,\F\<3,\N\<2 

where we note that p( 00 -'($, r) does not depend on j. The following lemma 
is proved in Appendix A. 

Lemma 5.8. As p -> oo, max log(?)) < i <p_i og ( p) \fi(ti,r,G) - p(°°\$,r)\ = 
o(l). 

Now, we show the claims. Write for short p* = p*(i9,r, G), and pij, p* cp 
similarly. First, we show 

d p {G y ) < L p . 
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Denote (Fj,N*) the minimum in denning p*-, and if there is a tie, we pick 
the one that appears first lexicographically. By definition and Lemma 5.7, 
for any log(p) < j < p — log(p), 

iP(F*,N*) = ~p* = p* + and (F* U N*) C {j, ■ ■ ■ , j + 4}. 

By the definition of Q v , these imply that there is an edge between nodes j 
and k only when \k — j\ < 4. So d p {Q y ) < C. 
Next, we show for all log(p) < j < p — log(p), 

P*j = P*c P + o{l). 
By Lemma 5.7 and Lemma 5.8, it suffices to show 



(5.112) P (oo) =P, 



cp • 



Introduce the function v(-;F,N) for each (F,N): 

, F N \ = f (1^1 + 2|iV|)/2 + w (oo) x/4, |F|iseven, 
V{X ' ' ' 1 (|F|+2|AT| + l)/2 + [(\/^M^-l/V^M^)+] 2 /4, |F| is odd, 

where cj(°°) is short for J°°\F,N). Then we can write 

^(°°\F,N;$,r,G) =#-i/(r/&;F,N). 

Let v*{x) = vam.rpm v(x\F,N)i where the minimum is taken over those 
(F, N) in defining . It follows that 

(5.113) p (oo) (^,r) = mm-&'i/(r/diF,N) = i? • v*(r/-d). 

{F,N) 

Below, we compute the function u*{-) by computing the functions z/(-; F, N) 
for the finite pairs (-F, N) in defining p^ 00 ^ . After excluding some obviously 
non-optimal pairs, all possible cases are displayed in Table 5. Using Table 
5, we can further exclude the cases with \F\ = 3. In the remaining, for each 
fixed value of we keep two pairs of (F, N) which minimize \F\ + 2|iV| 

among those with \F\ odd and even respectively. The results are displayed in 
Table 6. Then v*(-) is the lower envelope of the four functions listed. Direct 
calculations yield 



l + x/4, 0<x<6 + 2V10; 
/M '" { 3+ (y/x- 2/^) 2 /8, £>6 + 2VlO. 

Plugging this into (5.113) and comparing it with the definition of p* p , we 
obtain (5.112). □ 
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Table 5 
Calculation of cj (oo) (F, N) 

F N (Sj fc) ) F ' F ttj k) C u^{F,N) 

{1} {2} 1 - 11 



vO 1 2; 



1 -1 





\F\ 


1 7V| 


i/(a;; F, N) 


urik 


1 
1 


1 

2 


1 




1 + f 


i 
i 


1 

2 


1 


2 


i 


2 
3 


2 


2 


3+f 


i 



{2} {1,3} 2 1 \ 

{1,2} - (gj) (1,-1)' 1 

{1,2} {3} - 1 ) - (1,-1)' 1 

{2,3} {1,4} (y\ - (1,-1)' | 

/0 0\ 

{1,2,3} - Oil (1,-2,1)' 2 



{1,2,3} {4} -12-1 - (1,-|,1)' f 

\0 -1 2/ 

/2 -1 0\ 

{2,3,4} {1,5} -12-1 - (1,-1,1)' 1 

yo - 1 2 / 

Table 6 
Calculation of v(x\ F, N) 



5.7. Proof of Lemma 2.1. Fix •& and r. Write for short p* = r, G). 
To show the claim, it suffices to show for each 1 < j < p, 

(5.114) Pfa K) < L v\P~ P * j +P~ {m+1) *\ + o(l/ P ). 

Fix 1 < j < p. Recall that Q% is the subgraph of Q* by restricting the 
nodes into S(/3). Over the event {{3j ^ 0}, there is a unique component I 
such that j £l< Qg. By [14, 20], \X\ < m except for a probability of at 
most L p p~ where the randomness comes from the law of /3. Denote 
this event as A p = A p j. To show (5.114), it suffices to show 

(5.115) Pfc ^ 0, j i W p , Ap) < L p p-^ + o(l/p). 

Note that I depends on j3 (and so is random), and also that over the event 
A p , any realization of X is a connected subgraph in Q* with size < m. 
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Therefore, 

p{Pi ± o, j i u;, a p ) < Yl p (i e x < s* s , j i u;, a p ) , 

X:jeX<g*,\X\<m 

where on the right hand side, we have misused the notation slightly by 
denoting X as a fixed (non-random) connected subgraph of Q* . Since Q* is 
Kp-sparse (see Lemma 5.1), for any fixed j, there are no more than C(eK p ) m 
connected subgraph X such that j € X and |X| < m [14]. Noticing that 
C(eK p ) m < L p , to show (5.115), it is sufficient to show for any fixed X such 
that j 6l<5* and |X| < m, 

(5.ii6) p{j e x < g* s , j i u;, a p ) < L pP - p i + o(i/ P ). 

Fix such an I. The subgraph (as a whole) has been screened in some sub- 
stage of the PS-step, say, sub-stage t. Let N = U^ 1 ^ f~)l and F = 1\N be 
as in the initial sub-step of the PS-step. By definitions, the event {j ^ U*} is 
contained in the event that I fails to pass the x 2 '^ es ^ m (1.17). As a result, 

p(j ei<g* s ,jt u;,a p ) < p(j ei<g* s , r(d, f, n) < 2 q (P, n) io S ( p ),a p ) 

< P(j El<g* s ,T(d,F,N)<2q(F,N)log(p),A p ), 

(F,N):FL>N=X,FnN=$,F^$ 

where (F, N) are fixed (non-random) subsets, and q = q(F, N) is either as 
in (1.31) or in (1.36). Since \X\ < m, the summation in the second line only 
involves at most finite terms. Therefore, to show (5.116), it suffices to show 
for each fixed triplet (X, F, N) satisfying j £l<5*, |X| <m,FUN = l, 
F n N = and F / 0, 
(5.117) 

P(j el<g* s , T(d, F, N) < 2q{F, N) Iog(p), A p ) < L pI T^ + o(l/p). 

Now, we show (5.117). The following lemma is proved below. 

Lemma 5.9. For each fixed (X, F, N) such that 1 = FUN, XniV = 0, 
F 7^ and |X| < m, there exists a random variable Tq such that with probabil- 
ity at least 1 — o(l/p), \T(d, F, N) — Tq\ < C(log(p)) 1 ^ a , and conditioning on 
1 , Tq has a non-central x 2 -distribution with the degree of freedom k < |X| 
and the non-centrality parameter 

5 = (p F )> [Q F ' F - Q F ' N (Q N ' N )~ 1 Q N ' F ] /3 F , 

where Q is as defined in (1.15). 
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Fix a triplet (I, F, N) and let <5o be as in Lemma 5.9. Then 

(5.118) P[KG*s, T(d, F, N) < 2q{F,N)log(p), A pJ ) 

< P(X < G* s , To < 2q(F, N) log(p) + C(log(p)) 1/at ) + o(l/p) 

< P[l<g* s ) -P(r < 2g(F,iV)log(p) + C(log(p)) 1 / Q + o(l/p). 

Denote a;o = rr" 2 5o- By Lemma 5.9, (Tol/?- 1 ) ~ x|(2ru;olog(p)), where < 
m. In addition, (log^)) 1 /" <C log(p) by recalling that a > 1. Combining 
these and using the basic property of non-central ^-distributions, 

P(T < 2q(F,N)log(p)+C{log(p)) 1/a \ f) < Lpp -\iV^-^W^)) + \\ 

Inserting this into (5.118) and noting that P(l <\ Qg) < L p p'^, we have 

P{KQ* S} T(d, F, N) < 2q{F,N)log{p), A p ) < LpP ^WA{V^-^ q (F,N)) + ? +0 (l/p). 

Comparing this with (5.117) and using the expression of p*j in Lemma 5.3, 
it suffices to show 

(5.119) |Z|tf + [(yfiw - y/q{F, N)) + ] 2 > ip(F, N). 

Recall that q = q(F,N) is chosen from either (1.31) or (1.36). In the 
former case, since ujq > uj(F,N) by definition (see (1.35)), it follows imme- 
diately from (1.31) and (1.32) that (5.119) holds. Therefore, we only consider 
the latter, in which case q(F,N) = q\F\ and (5.119) reduces to 

(5.120) \T\$ + [{^- vW\)+} 2 > <P(F,N). 
By the expression of ip(F,N), 

ip(F, N) < {\T\ - |F|/2)0 + (wr/4 + d/2) < |X|t? + wr/4, 

where w is a shorthand of u(F,N). Therefore, to show (5.120), it suffices to 
check 

(5.121) (y/U&F- > 

Towards this end, recalling that F C X, we let £ and £ be the respective 
submatrices of (G 1 ' 2 ') -1 and Q~ l formed by restricting the rows and columns 
from X to F. Let = t~ 1 (3 f . By elementary calculation and noting that 

a > a*(G), 

u = min £'£ _1 £, loo = (CY^C- 

£eRl F l:l<|£i|<a 
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On one hand, since G G .M*(7, g, cq, A%) and \T\ < m < <?, 

(5.122) oj > \F\ ■ X m m(G x,x ) > co • \F\. 
On the other hand, noting that ||£*||oo < o,, 

(5.123) Icj-cjoI < max - XT 1 )^ < (a 2 ■ - S _1 ||)|F|. 

5eMl F l:l<|6l<a 

We argue that — can be taken to be sufficiently small by £ ps suffi- 
ciently large. To see the point, note that us -1 -!;- 1 !! < iig^iPiiq- 1 !! 2 !!^ 1 ' 1 - 
Q\\. First, since \I\ < m, ||G? 2: ' :I || 2 < C. Second, note that Q is the Fisher In- 
formation Matrix associated with the model d XPS ~ N (B XPS ' X (3 X , H XPS ,XPS ) . 
Using Lemma 1.2 and (5.143), \\G X ' X -Q\\ < C(£ ps )-^. Third, \\(G x ^ x y 1 \\ < 
Cq 1 , since G G M*(j, g, c , A x ). Finally, < 2CQ 1 when G x < x and Q 

are sufficiently close. Combining these gives that — < C(£ ps )~~' , 

for sufficiently large £ ps , and the claim follows. 

As a result, by taking £ ps a sufficiently large constant integer, we have 

(5.124) a 2 ^- 1 -t-'W < (i^-y^) 2 , 

where we note the right hand side is a fixed positive constant. Combining 
(5.122)-(5.124), 

- u ) + < {^Vco - \ft[/r)\/W\< ^q\ F \/ r ' 

where the first inequality follows from (5.123) and (5.124), as well as the 
fact that q < cor/4 (so that \^fco — \/q~/r > 0); and the last inequality 
follows from (5.122). Combining this to the well known inequality that y/a + 
^{b-a) + > Vb for any o, b > 0, we have 

v^o > - - w o)+ > - (^v 7 ^ - \fq\F\/ r ) > \y<^ + Vq\F\/ r > 

and (5.121) follows directly. □ 

5.7.1. Proof of Lemma 5.9. Recall that T(d, F, N) = W'Q- l W-W' N {Q N)N y 
where d = DY, W and Q are defined in (1.15) which depend onI = FUN, 
and Wn and Qn,n are defined in (1.16). Let V = S((3)\I. By definitions, 

W = Qp x H+u, where £ = ( B ^ ps {H XPS ? ps )~ l B XPS > v p v and u ~ N(0, Q). 

Denote W = QP X + u. Introduce a proxy of T(d, F, N) by 

T (d, F, N) = W'Q- l W - (W N )'(Q N ' N y 1 W N . 
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Write for short T = T(d,F,N) and T = T (d,F,N). To show the claim, 
it is sufficient to show (a) \T — Tq\ < C(log(p)) 1 ^ a with probability at least 

l- (l/p)and(b)(r |/3 I )~x|(*o). 

Consider (a) first. By direct calculations, 

(5.125) \T-T \ < 2||e|| • {2\\f3 x \\ + WQ^WM + 2||Q- 1 / 2 ||||Q- 1 / 2 u||). 

First, since \1\ < m and ||/3||oo < « r p < Cy/\og(p), \\f3 x \\ < C^J\og(p). Sec- 
ond, by definitions, max{||Q _1 / 2 ||, ||Q _1 ||} < C. Last, note that Q~ l l 2 u ~ 
iV(0, I\x\) and so with probability at least 1 — o(l/p), ||Q _1 ^ 2 ^|| < CyJ\og{p). 
Inserting these into (5.125), we have that with probability at least 1— o(l/p), 

\t-t \<cm\\(V^(p) + M\\)- 

We now study ||£||. By definitions, it is seen that 

||e[| < \\B XPS > X \\ ■ \\(H x "^ P ')- 1 \\ ■ \\B XPS > V /3 V \\. 

First, we have \\B XPS > X \\ < \\B\\ < C. Second, since \l ps \ < C, by RCB, 
Xmm(H XPS ' XPS ) > Cl \l ps \- K > C > 0, and so \\(H Ips > Ips )~ l \\ < C. Third, by 
basic algebra, 

(5.126) \\B XPS > v f3 v \\ < y/&r\. H^'^IU < CWB^^U • 

Here, we note that \\B XP3 > V < \\B — -B**||oo, where B** is defined in 
Section 5.1, and where by Lemma 5.1, \\B - B**||oc < C^log^))^ 1 - 1 /"). 
As a result, ||i? :IPS '^ / || 00 < C(\og(p))^ ( - 1 ^ 1 ^ a \ Inserting this into (5.126) and 
recalling that ||/3^||oo < Cy/\og(p), 

\\B XPS ' V (3 V \\ < Cilogip))^ 1 - 1 ^ ■^g~(pj = C(log(p)) 1 / a - 1 / 2 . 

Combining these gives that ||f || < C^ogip)) 1 /^ 1 / 2 . This, together with 
(5.125), implies that 

|t-t | < C(iog( P )) 1 /«-i/2 [ y^y +(log(p)) i/ a -i/2 ] < c[(iog( P )) 1 / a +(iog(p)) 2 / a - 1 ], 

and the claim follows by recalling a > 1. 

Next, consider (b). Write for short R = (H XPS XP& ')~ 1 / 2 B XPt * ,x . Also, recall 
that F and N are subsets of X. We let Rf and -Rat be the submatrices of R 
by restricting the columns to F and N, respectively (no restriction on the 
rows). By definitions, Q = R'R and u ~ N(0,Q), so that we can rewrite 
u = R'z for some random vector z ~ iV(0, I'm*]). With these notations, we 
can rewrite To as 

T = {Rf + z)'[R{R'R)- x R' - R N {R! N R N y 1 R! N ]{R0 L + z). 
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Therefore, {T Q \(3 X ) ~ xifio) [21], where k = rank(i?) - vank(R N ) < \X\, and 

~5 = (RfYiRiR'R)- 1 ^ - i? A r(^i? JV )- 1 ^]( J R/3 I ). 

By basic algebra, So = So- This completes the proof. □ 

5.8. Proof of Lemma 2.2. Viewing IA* as a subgraph of Q + , we recall 
that X<\IA* stands for that X is a component of U* . The assertion of Lemma 
2.2 is that there exists a constant integer Iq such that 

(5.127) P(\l\ > l for some X <U*) = o(l/p). 

The key to show the claim is the following lemma, which is proved below: 

Lemma 5.10. There is an event A p and a constant C\ > such that 
P(Ap) = o(l/p) and that over the event A p , ||<i :z:ps || 2 > 5Ci|X| log(p) for all 
X<U*. 

By Lemma 5.10, to show (5.127), it suffices to show 

(5.128) P{\X\ > lo for some X< U*, A p ) = o(l/p). 

Now, for each 1 < j < p, there is a unique component X such that j G X<\U*. 
Such X is random, but any of its realization is a connected subgraph of Q + . 
Therefore, 
(5.129) 

p oo 

P(\X\> l for some X<U* p , A p ) < P(j € X<U*, A p ), 

j=i i=h+ix-.jex<g+ ,\x\=i 

where on the right hand side we have changed the meaning of X to denote 
a fixed (non-random) connected subgraph of Q + . We argue that 

(a) for each (j, I), the third summation on the right of (5.129) sums over 
no more than L p terms; 

(b) there are constants 62,63 > such that for any (j,X) satisfying j £ 

x<g + , p(j ex<u;, a p ) < L p [p- c *Vm+ p - c 3ffl]. 

Once (a) and (b) are proved, then it follows from (5.129) that 

P(\X\ > l for some X < U*, A p ) < L p [p l - C ^ + p l ~ Cil °] , 

and (5.128) follows by taking to sufficiently large. 

It remains to show (a) and (b). Consider (a) first. Note that the number of 
connected subgraph X of size I such that j G X<Q + is bounded by C(eK+) 1 
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[14], where K p is the maximum degree of Q + . At the same time, by Lemma 
5.1 and Lemma 5.2, Kp is an L p term. Combining these gives (a). 

Consider (b). Denote V = {j : B**(i,j) / 0, for some i G I ps }, where 
B** is defined in Section 5.1. Write for short d\ = d XPS , B\ = B TPS ' V and 
H\ = ^ PS . With these notations and by Lemma 5.10, (b) reduces to 

(5.130) P(j eZ<U*, ||di|| 2 > 5Ci|J|log(p)) < Lplp-c^SW +p- Ca W]. 

We now show (5.130). Note that d x = B 1 f3 v + £ + z, where £=[(£- 
B**)f3] XPS and z ~ N(0,H\). For preparation, we claim that 

(5.131) iieii 2 = m- (io g (p)). 

In fact, first since £P S is finite, \1P S \ < C\Z\ and it follows that ||£|| 2 < 
C\I\ ■ UfH^. Second, by Lemma 5.1, \\B - B**^ = o(l). Since H^oo < 

CLT p < Cy^(pj, it follows that Halloo < \\B ~ P** ||oc ||/3||oc = (^bfe)). 

Combining these gives (5.131). 

Now, combining (5.130) and (5.131) and using the well-known inequality 
(a + b) 2 < 2a 2 + 2b 2 for a, b E R, we find that for sufficiently large p, 

(5.132) P(jel<U;, H^iH 2 > 5d|X| log(p)) 

< P(jel<U;, ||Pi/3 y + z|| 2 >4Ci|X|log(p)) 

< P(jel<U;, || J B 1 /3 y || 2 + p|| 2 >2Ci|X|log(p)) 

< P{\\B^ v f > C X \T\ log(p)) + P(||z|| 2 > d\X\ log(p)) =1 + 11. 

We now analyze I and II separately. Consider / first. We claim there is a 
constant C4 > 0, not depending on \X\, such that ||Pi/3 y || < C4 log(p) ||o 
To see this, note that ||Pi/3 y || < ||-Bi/3 V ||i < ||Pi||i||/3 y ||i, where ||Pi||i < 
||P||i < C, with C > a constant independent of At the same time, 
\\f3 V \\i < aT p\\P V II o- So the argument holds for C4 = 2ra 2 C 2 . Additionally, 
\\/3 ||o has a multinomial distribution, where the number of trials is \ V\ < L p 
and the success probability is e p = p . Combining these, we have 

(5.133) / < P{\\/3 V \\ > V(Ci/C 4 )|X|) < L pp -*LV^M)WJ j 

where [xj denotes the the largest integer k such that k < x. 

Next, consider II. Note that ||-Hi|| < \\H\\ < C5, where C5 > is a 
constant independent of \X\. It follows that ||z|| 2 < C^H-ffi 1 ^ 2 z\\ 2 , where 
\\H 1 1 ^ 2 z\\ 2 has a x 2 -distribution with degree of freedom |X ps | < C\T\. Using 
the property of x 2 -distributions, 

(5.134) // < P(\\H7 1/2 z\\ 2 > C X C 5 \T\ log(p)) < L p p-^ C5 WW. 
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Inserting (5.133) and (5.134) into (5.132), (5.130) follows by taking C 2 > 
■dy/Ch/Ck and C 3 > CiC 5 /2. □ 

5.8.1. Proof of Lemma 5.10. For preparation, we need some notations. 
First, for a constant 5q > to be determinded, define the p x p matrices B 
and H by 

B(i,j) = B(i,j)l{\B(i,j)\ > 5 }, H(i,j) = H(i,j)l{\H(i,j)\ > 6 }. 

Second, view U* as a subgraph of Q + . Note that in the PS-step, each 
Qt is a connected subgraph of Q + . Hence, any Q t that passed the test must 
be contained as a whole in one component of U*. It follows that for any 
Z <l Up, there exists a (random) set T C {1, • • • ,T} such that X = Ut^rGt- 
Therefore, we write 

? = U*° ^ 

where each Vi = Qt for some t £ T, and these V^s are listed in the order 
they were tested. Denote N t = n Qt and F { = Q t \Ni. Let and 

be the vector W and matrix Q in (1.15). From basic algebra, the test 
statistic can be rewritten as 
(5.135) 

T{d,F h Ni) = K)H 2 , «(0 - S » /2 [<) " ^'(^W^)" 1 ^]. 

where S (l) =Q|^-Q^[Q^]-Q^. 
Third, define 

w{ i) = (B v rMy(H v r s > v r r i d vr\ 

and itf^ as in (5.135) with replaced by Wty Let u be the \I\ x 1 vector 
by putting {uu\, 1 < i < §o} together, and define u* similarly. 

With these notations, to show the claim, it suffices to show there exist 
positive constants Cq, C-j such that with probability at least 1 — o(l/p), for 
any 1 < U* , 

(5.136) K|| 2 >C 6 |Z|log(p), 
and 

(5.137) \\u*\\ 2 <C 7 \\d IPa \\ 2 . 

Consider (5.136) first. Since each Vi passed the test, ||u(j)|| 2 > t(Fi,Ni). If 
t{Fi,Ni) is chosen from (1.31), t(Fi,N t ) > 2q log(p) > 2{qa/m)\F^ log (p); 
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otherwise it is chosen from (1.36), then t(Fi,Ni) > 2q\Fi\ log(p). In both 
cases, there is a constant q > such that 

\\u {i) \\ 2 >2q\F i \log(p), l<i<s . 

In addition, it is easy to see that Uji^ is a partition of X. It follows that 

So 

(5.138) ||n|| 2 = ^||n w || 2 >2 (/ |X|log(p). 

i=i 

At the same time, let A p be the event {||<i||oo < Co-\/log(p)}, where we 
argue that when Co is sufficiently large, P(Ap) = o(l/p). To see this, recall 
that d = Bf3 + i? 1 / 2 !: , where z ~ -/V(0, i" p ). By the assumptions, ||-B||oo < C, 
H/3IU < C7ldg(p) and Hi^rlU < C Therefo re, {[d]^ < (7(y1og(p) + \\z |U)- 
It is well-known that Pdl^H^ > y^alog^)) = L p p~ a for any a > 0. Hence, 
when Co is sufficiently large, P(Ap) = o(l/p). 

We shall show that over the event A p , by choosing 5q a sufficiently small 
constant, 

(5.139) \\u-u*f <q\l\\og(p)/2. 

Once this is proved, combining (5.138) and (5.139), and applying the in- 
equality (a + b) 2 < 2(a 2 + b 2 ) for any a, b 6 M, we have 

2g|X| log(p) < ||u|| 2 < 2(||n*|| 2 + ||u - u*|| 2 ) < 2||u*|| 2 + q\Z\ log(p). 

Hence, (5.136) holds with Cq = q/2. 

What remains is to prove (5.139). It follows from G G M.*^, g, Co, A{) 
and |Vi| <m<g that || (G^) -1 II < cf 1 . As a result, ||Q-j|| < C. Also, 
S^ 1 is a submatrix of Q7% ; and hence ||S7j*|| < C. This implies 

(5.140) K)-4)|| <C\\W ({) -Wfal l<i<s . 

Since B enjoys a polynomial off-diagonal decay with rate a, \\{B—B) v i ||oo < 
C<5j _1/a . Noting that |T<f"| < C, this implies 11(5 - B) v ^^\\ < C5 l ~ 1/a . 
Similarly, we can derive \\(H — H)^ ,v i || < C5q . These together imply 

(5.141) || - Wfo\\ < C^ 1/Q ||^ P3 || < CS^WdU, l<i<s , 

where in the last inequality we use the facts that \V[ S \<C and Halloo ^ 
1 1 c? 1 1 oo . Combining (5.140) and (5.141), over the event A p , 

IK) - u* {l) \\ 2 < C5 2{1 - 1/a) log(p), 1 < i < s . 
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Noting that a > 1, we can choose a sufficiently small So such that C5 < 
q/2, and (5.139) follows by noting |sq| < |^|- 
Next, consider (5.137). We write 

where the matrices S, T and are defined as follows: H is a block-wise 
diagonal matrix with the i-th block equals to ■ T is a \I\ x (X^i=i 1^1) 
matrix, with the (_Fj, V^)-block is given by 

and elsewhere. 6 is a (£%L 1 \Vi\) x \l ps \ matrix, with the (V5, Kf s )-block 
and elsewhere. 

Note that these matrices are random (they depend on U* and I). Below, 
we show that for any realization of IA* and any component X <U*, 

(5.142) pre|| < c. 

Once (5.142) is proved, (5.137) follows by letting C 7 = C 2 . 

We now show (5.142). Since ||Hr0|| < ||S||||r||||0||, it suffices to show 

||S||,||r||,||G|| < C. 

First, ||H|| < maxj IIQ^ 1 !! < C. Second, the entries in T and have a uniform 
upper bound in magnitude, and each row and column of Y has < m non-zero 
entries. So ||r|| < C. Finally, each row of has no more than 2m£ ps entries; 
as a result, to show ||0|| < C, we only need to prove that each column of 
also has a bounded number of non-zero entries. 

Towards this end, write for short Bu\ = B v t ,Vi and Hu\ = H v i ,v i for 
each 1 < i < sq. By definition, 

@(k,j) = Y, B (l) (f,k)H^(j',j), k G V h j G VI s . 
j'evr 

First, given the chosen 5q, each row or column of B and H has < Lq non-zero 
entries, where Lq is a constant integer. Therefore, for each j', the number 
of k such that B(j',k) ^ is upper bounded by Lq. Second, we define a 
graph Q = Q{5q) where there is an edge between nodes j and j' if and only if 
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H(j,j') ^ 0. For each 1 < i < so, let Gi be the restriction of Q to the nodes 
in V[ s . We see that Hu\ is block-diagonal with each block corresponding 
to a component of Gi, and so is (Hu\) . This means (Hu-s) (f, j) can be 
non-zero only when j and j' belong to the same component of Gi. Since 
\Vf s \ < 2mi ps for all i, necessarily, there exits a path in G of length < 2m£ ps 
that connects j and f. Third, since G is Lo-sparse, f° r each j, the number 
of j' that is connected to j with a path of length < 2m£ ps is upper bounded 
by l^ nlva . In summary, for each fixed j, there are no more than Lq ■ Lg mffs 
nodes k such that @(k,j) ^ 0, i.e., each column of has < Lq" 1 ^ +1 nonzero 
entries and the claim follows. □ 

5.9. Proof of Lemma 2.3. Fix X and recall that J = {j ' : D(i,j) ^ 
0, for some i G X}. In this lemma, I pe is as in Definition 1.6, but J pe 
is redefined as J pe = {j : D(i,j) ^ 0, for some i G I pe }. Denote M = 
\J pe \ — \X pe \ and write G^ ^ " = G\ for short. Let T be the mapping from 
J pe to {1, • • • , | J pe \} that maps each j G J pe to its order in J pe . Denote 
X\ = F{X). By these notations, the claim reduces to: for any \J pe \ x M 
matrix U whose columns contain an orthonormal basis of N ull(X pe , J pe ) , 

WPiu'G^uy^f^W = (i). 

It suffices to show 

(5.143) WpiU'G^Uy^'f 1 ' 11 ]] <C{£ pe )~\ 

where 7 > is the same as in M* p (j, g, cq, A\). In fact, once this is proved, 
the claim follows by noting that £ pe = (log(p)) u —> 00. 
We now show (5.143). By elementary algebra, 

(5.144) \\[U{U'GyU)- l U'\ Xl > Xl \\ < \\(U'GyU)- l \\ \\{UU') Xl > Xl \\ . 

Consider || {VG^U)- 1 || first. Since U'U is an identity matrix, we have 
WiU'G^U)-^ = [X^U'G^U)]- 1 < [XnUGi'T 1 = ||Gi||. Addition- 
ally, the assumption G G A4*(7, g, cq, A\) implies that ||Gi|| < A\ J2j=i J 7 — 
C\jpz\t-~i. Last, when \X\ < l , 2£ pe + 1 < \J pe \ < (2£ pe + 1)Z . Combining 
the above yields 

(5.145) WiU'G^U)- 1 ]] < C{£ pe ) l -~<. 

Next, consider || (UU') Xl ' Xl \\ . Note that \\{UU') Xl > Xl \\ < |Zi|-max;y eXl \{UU')(i, 
where max^j/gjj \U'U(i, < M-maxjgi^i^x^/ \U{i,j)\ 2 . Here \X\\ = \X\ < 
l and M < h\X\ < hl . It follows that 

(5.146) \\(UU') X ^\\ < C. max \U(i,j)\ 2 . 

lG-£l,l<J<M 
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The following lemma is proved in Appendix A. 

Lemma 5.11. Under the conditioins of Lemma 2.3, for anyX<Q + such 
that \X\ < Iq, and any matrix U whose columns form an orthonormal basis 
of Null(l pe ,J pe ), 

max \U(i,j)\ 2 < C(£ pe )-\ 

i&HX),l<3<\J pa \-\XP e \ 

Using Lemma 5.11, it follows from (5.146) that 

(5.147) \\{UU') Xl ^\\<C{l pe )- 1 . 

Inserting (5.145) and (5.147) into (5.144), we obtain (5.143). □ 

5.10. Proof of Lemma 2.4- Write for short 

p v 
Mi = J2 E P[KU;, ApHE^), M 2 =^P(ft/0,^W;). 

3=1 X:jeX<g+,\Z\<l k=l 

With these notations, the claim reduces to M\ < L p ■ Mi- 
The key is to prove 

(a) for each X < Q + , over the event {I < U*, A p n E^ x }, it always holds 

that {S{p)n£{i pe ))\u; ^0; 

(b) for each k, there are no more than L p different I such that I < Q + , 
\I\ < l and k 6 8{Z pe ). 

Once (a) and (b) are proved, the claim follows easily. To see the point, we 
note that 

P((5(/3)n£:(ZP e ))W p V0) < £ P{/3 k ^0, kiW p ). 
Combining this with (a), we have 

M ^E E E p (/^°> k iK)- 

3=1 X:jeX<G+,\X\<lo k&£(TPe) 

By re-organizing the summation, the right hand side is equal to 

E E |x|-p(/3 fc /o, k^u;), 

fc=i i-.i<g+ ,\i\<i ,ke£(XP e ) 
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which < L p ■ M2 by (b), and the claim follows. 

We now show (a) and (b). Consider (a) first. Fix X < Q + . Suppose (a) 
does not hold, i.e., the following event 

{x<u;, (s(p)ns(in)\K = ^ a p^ e p,x} 

is non-empty. View U* as a subgraph of Q + . Applying Lemma 5.2 to V = U*, 
we find that X < U '* implies (W*\X) n £{l pe ) = 0. Therefore, the following 
event 

(5.148) {(u;\i)n£(in = 0, (S(p)n£(i pe ))\u; = 0, A p nE c PtX } 

is non-empty. Note that X C £{X pe ). From basic set operations, (W*\X) n 
£{XP e ) = and (S(/3) n £{Z pe ))\U; = together imply 

(5(/3) n £(x pe )) C X. 

By definition, this belongs to the event E p x- Hence, the event in (5.148) is 
empty, which is a contradiction. 

Consider (b) next. Fix k and denote fC the collection of X satisfying the 
conditions in (b). Let V = {l < % < p : G f ({i} pe )}. Since «?(X pe ) = 
Ujgx^ ({i} pe ), we observe that 

/C = Ujgy/Cj, where /Cj = {X : X < C/ + , |X| < Zo, i G X}. 

Note that by Lemma 5.1 and 5.2, Q* is X p -sparse and Q + is i^-sparse, 
where both K p and if^ are L p terms. First, we bound |V|: By definition, 
k G £({i} pe ) if and only if there exits a node A;' G {z} pe such that A;' and k 
are connected by a length-1 path in Since Q* is -ffp-sparse, given k, the 
number of such k' is bounded by Xp. In addition, for each k' , there are no 
more than (2£ pe + l) nodes % such that k' G {i} pe . Hence, |V| < (2£ pe + l)K p . 
Second, we bound maxj g y \ F° r each node i G V, there are no more than 
C^e-fTp^)' connected subgraph of Q + that contain i and have a size < Iq 
[14], i.e., |/Cj| < C(eif+)'°. Combining the two parts, \K\ < K p (2£ pe + 1) • 
C{eK+) l \ which is an L p term. □ 

5.11. Proof of Lemma 2.5. Let V x = S{f3)n£(l pe ) and V 2 = S{P)\£(l pe ). 
We have {Bf3f pe = B xpe ^f3 v ^ + Q, where C = B Tpe > V2 /3 V2 . Note that over 
the event E p>x , Vi C X. It follows that B XPe < Vl l3 Vl = B XPe ' X f3 x . Combining 
these, to show the claim, it is sufficient to show 



(5.149) 



ii C ii <c(F e ) 1/2 [i°g(p)r (1 - 1/Q) v 
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Recall the matrix B** defined in Section 5.1. Since B** = for j £ V2, 
we have \\B Xpe ' V2 < \\B — B**]]^, where by Lemma 5.1, \\B — .B**||oo < 
C[log(p)]~( 1_1 / a ). Moreover, ||/3||oo < clt p . Consequently, 



(5.150) HClloo < - ^Hooll^lloo < CflogCp)]^ 1 - 1 /^ 



p- 



At the same time, note that \T"P e \ < l Q {2P e + 1) < C£ pe . It follows from 
the Cauchy-Schwartz inequality that ||C|| < y^lllClloo < C ) 1 / 2 1 1 C 1 1 00 - 
Combining this with (5.150) gives the claim. □ 



5.12. Proof of Lemma 2.6. Fix (j, Vb, V\,T) and write for short Pj(Vo, V\) = 
Pj (V , Vi;I) and p* = p*($, r, G). The goal is to show Pj (V , V x ) > p* + o(l). 
We show this for the case Vb 7^ Vi and the case Vb = VI separately. 

Consider the first case. By definition, p* < p(Vo, Vi), where p(Vq, V\) is as 
in (1.26). Therefore, it suffices to show 

(5.151) p J (V ,V 1 ) = p(V ,V 1 )+o(l). 
Introduce the function 

f(x) = max{|Vb|, |Vi|}tf + \[{yfc - \\V \ - \Vi\\$ / ^c) + ] 2 , x > 0. 

Then Pj(Vo, V\) = f{vjjr) and p(Vo, V\) = f(w*r), where zuj = Wj(Vo, V\;X) 
and w* = -07*(Vb,Vi), defined in (2.59) and (1.25) respectively. Since f(x) 
is an increasing function and \ f(x) — f{y)\ < \x — y|/4 for all x,y > 0, to 
show (5.151), it suffices to show 

(5.152) zuj >uj* + o(l). 

Now, we show (5.152). Introduce the quantity w = mmj^y^y^ Wj. Write 
Bi = b tv\t^ Hi = H TP*,Tv° and q x = b[H{ 1 Bi. Given any C > 0, define 

O(C) as the collection of vectors £ £ IR' 1 ' such that for all i, either £• = 

or \^ k) \ > 1, and that Supp(fM) = V k , M {k) \\oo < C, for k = 0, 1. Denote 
© = O(oo). By these notations and the definitions of zuj and w* , we have 

min ( ^)_ e (0) ) /Q l(e (l)_ e (0) )) 

(C (0) ,C (1) ): ?( fe )ee,fc=0,l;sgn(C( ))^sgn(^( 1 )) 

w* = min (£« - £(0))' G W« _ £(°)). 

(£(0)^(1)). £(fc) 6 e(a),fc=0,l;sgri(£(°))^sgn(£(l)) 

First, since a > a*(G), in the expression of zj*, 0(a) can be replaced by 
O(C) for any C > a. Second, since A m i n (Qi) > C, from basic properties of 
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the quadratic programming, there exists a constant ao > such that for any 
(£*,£*), a minimizer in the expression of vj, max{||^ ^ Hoc, ||oo} < ^o- 
Therefore, in the expression of w, can be replaced by 6(C) for any C > ao- 
Now, let a\ = max{ao, a} and we can unify the constraints in two expressions 
to that £( fc ) G G(ai), for k = 0, 1, and sgn(^°)) ^ sgn(£W). It follows that 

(5.153) \w-w*\< max ^'(G 1 ' 1 - Qi)f| < C\\G XX - Qi||. 

^eRl z l:||^|| 00 <2ai 

Note that Q\ is the Fisher Information Matrix associated with model d\ ~ 
NiB^^Hi), by Lemma 1.2 and Lemma 2.3, HG 1 ' 1 - Qi || = o(l). Plugging 
this into (5.153) gives \w — w*\ = o(l). Hence, vjj > w > w* + o(l) and 
(5.152) follows. 

Next, consider the case Vq = V±. Pick an arbitrary minimizer in the defini- 
tion oivjj, denoted as (£* ), and define F = {k : sgn(^^) ^ sgn(^^)} 
and A?" = Vo\F. It is seen that j S F. By Lemma 5.3, p* < ip(F,N), where 
vjj(F,N) is defined in (1.33). Hence, it suffices to show 

(5.154) PjWoM) >j>(F, N) + o(l). 

On one hand, when |Vo| = \V\\, the function / introduced above is equal to 
|Vo|?9 + x/A and hence 

Pj(V , V x ) = f(w 3 r) = |Vb|tf + wjr/A. 

On the other hand, using the expression of ip(F,N) in (1.33) and noting 
that \F\ > 1, 

ip(F, N) < (\F\ + \N\)0 + cor/4 = \ V \ti + wr/4, 

where oj = w(-F, iV) is defined in (1.34). Therefore, to show (5.154), it suffices 
to show 

(5.155) Wj>oj + o(l). 

Now, we show (5.155). From the definition (1.34) and basic algebra, we 
can write 

u = min £'G X ' X £. 

f 6 am : ^=o,^Vb;l^l>i,ieF 

Denote = ^ — By our construction, Wj = C'*QiC*i £*i = for i ^ Vo, 
and > 2 for i 6 F. As a result, 

(5.156) > w. 
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At the same time, we have seen in the derivation of (5.153) that there exists 

a constant ao > such that \\Q ||oo> ||oo < a o an d WG 1 ' 1 — Qi\\ = o(l). 
Therefore, || 2 < 2a \l\ < C and 

(5.157) \wj - £G X ' X £*| = \CQiC*-CG X > X dA < C\\G I ?-Qx\\ = o{l). 
Combining (5.156) and (5.157) gives (5.155). □ 

APPENDIX A: SUPPLEMENTARY PROOFS 
In this section, we prove Lemma 5.5, 5.7, 5.8 and 5.11. 

A.l. Proof of Lemma 5.5. Write k p = 1 on.atK\ og ^ < j <p _\ og ^ G -1 (j, j) 
and oo = 2^ f~ f~ 1 {oo)du). The assertion of Lemma 5.5 is 

lim k„ = ao- 

p— >oo 

To show this, denote K p = mini og ( p ) <:)< p_ log ( p ) G~ 1 (j,j), and k p = trace(G _1 )/p. 
Since log(p) <C p and all diagonals of G~ l are bounded from above, it follows 
from definitions that 

(A. 158) K p + o{l) < k p <k p + o(1). 

At the same time, the conditions of Lemma 1.5 ensure that /*(w) is contin- 
uously differentiable on [— w, tt]. By [26], 

lim k p = ao- 

p— »oo 

Therefore, liminfp_> 00 K p > linip^oo k p = ao, and all we need to show is 
limsup p _> 00 7t p < a . 

Towards this end, write G = G p to emphasize on its dependence of p. 
For any positive definite p x p matrix A and a subset V C { 1 , • • • , p} , if 
we let B\ be the inverse of A v ' v and B 2 the (V, V)-block of A' 1 , then by 
elementary algebra, B 2 — B\ is positive semi-definite. Now, for any 
such that log(p) < j < p — log(p) and 1 < i < [log(p)J , let V = {j — 
i + 1, • • • , j — i + [log(p)J} (L^J denotes the largest integer k such that 
k < x). Applying the above argument to the set V and matrix A = G p , we 
have [(Gp) v ' v ]~^ (i, i) < Gp 1 ^', j). At the same time, the Toeplitz structure 
yields {G p ) vy = G L i og (p)j. As a result, G^ g(p)J (i, i) < G^ijJ). Since this 
holds for all i and j, we have 

K Uog(p)J - ^p- 

Combining this with the first inequality of (A. 158), K^ og ( p )j < k p + o(l). It 
follows that limsupp^^TSp < linip^oo k p and the claim follows. 

We remark that additionally lmip^ooKp = ao, whose proof is similar so 
we omit. □ 
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A. 2. Proof of Lemma 5.7. Fix log(p) < j < p — log(p). Denote the 
collection of pairs of sets 

Cj = {(F, N) : min(F U N) = j, F n N = 0, F ^ 0}, 

and its sub-collection 

C* = {(F, N)eCj-.F G ft p , (FUN) G ft p , |F| < 3 and \N\ < 2}, 

where we recall that 1Z P is the collection of sets that are formed by consec- 
utive nodes. The claim now reduces to 

min ib(F,N)= min ib(F, N) + o(l). 
(F,N)£C* (.F,N)eCj 

Noting that C* C Cj, it suffices to show for any (F,N) G Cj, there exists 
(F',N r ) such that 

(A.159) i>(F',N') < ip(F,N) + o(l) and (F',N')eC*. 

To show (A.159), we introduce the notation (F' , iV') ^ (F, iV) to indicate 
tp(F',N') < ip(F,N), \F'\ < \F\, and \N'\ < \N\. 

Using these notations, we claim: 

(a) For any (F, N) G Cj, there exists (F' , N') G such that ip(F',N') < 
iJj(F,N)+o(l) and \F'\ < 3. 

(b) For any (F, N) G Cj, there exists (F',N f ) G C,- such that (F',N') < 
(F,N) and (F' U iV') G TZ p . 

(c) For any (F, N) G Cj satisfying (FUN) G ft p , there exists (F', N') G C,- 
such that (F' , N') < (F, N), (F' U N') G K p and F' G ft p . 

(d) For any (F, N) G Cj satisfying (F U N) G ftp and F G TZ P , there exists 
(F',iV) G Cj such that (F',N') * (F, N), (F U iV') G ft p , F' G ft p 
and |JV'| < 2. 

Now, for any (F,N) G Cj, we construct (F',N') as follows: First, by (a), 
there exists (F 1 ,N 1 ) such that i/}(F u N{) < ip(F,N) + o(l), and |F\| < 3. 
Second, by (b) and (c), there exists (F 2 ,N 2 ) such that (F 2 ,N 2 ) ^ (Fi, JVi), 
F 2 G TZ P and (F 2 U N 2 ) G TZ P . Finally, by (d), there exists (F 3 ,N 3 ) such 
that (F 3 ,A 3 ) ^ (F 2 , A 2 ), (F 3 U N 3 ) G ft p , F 3 G ft p and |JV 3 | < 2. Let 
(F',iV') = (F 3 ,iV 3 ). 

By the construction, (F U N') G ft p , F' G TZ P and 

V>(F', AT') = V(F 3 , 2V 3 ) < ^(F 2 , JV 2 ) < ip(Fi,Ni) < ^(F, N) + o(l). 
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Moreover, \F'\ = \F 3 \ < \F 2 \ < \F X \ < 3, and \N'\ = \N 3 \ < 2. So (F' , N') 
satisfies (A. 159). 

All remains is to verify the claims (a)-(d). We need the following results, 
which follow from basic algebra and we omit the proof: First, recall the 
definition of w(F,N) in (1.34). For any fixed (F,N), let 1 = F U N and 
R= (G 1 ' 1 )- 1 . Then 

(A.160) oj(F,N)= min t'(R F ' F )~ l £. 

eeRl^l:fe[>l 

Second, when (F U N) G K p , 

(A.161) R= -ryr/' + si fc) , k = \FUN\, 

J 

where 77 = (1,0,--- , 0)' and £* is as in (5.110). 

Now, we show (a). The case \F\ < 3 is trivial, so without loss of generality 
we assume \F\ > 4. Take 

F' = {j + l,j + 2}, N' = {j}. 

We check that (F',N') satisfies the requirement in (a). It is obvious that 
(F',N') G Cj and \F'\ < 3. We only need to check ^(F',N') < ii{F,N) + 
o(l). On one hand, direct calculations yield u(F',N') = (j + l)/(j + 2) = 
1 + o(l), and 

ip{F',N') < 2?? + r/4 + o(l). 

On the other hand, by (A.160), u(F,N) > \F\ ■ [Amax^)]- 1 > \F\ ■ A min (G). 
Noting that G^ 1 = H, we have || C~ 1 1| < \\H\loo < 4. So A min (G) > 1/4. 
Therefore, lj(F, N) > 1. It follows that 

ip(F, N) > \F\tf/2 + u(F, N)r/i > 2-d + r/4. 

Combining the two parts, we have ip{F\N') < ip(F,N) + o(l). 

Next, we verify (b). We construct (F',N') by constructing a sequence of 
(FWjJVW) recursively: Initially, set F^ = F and iVW = N. On round 
t, write F&' U = {ji,-- m ,jk}, where the nodes are arranged in the 
acceding order and k = \F^> U N^'\. Let Iq be the largest index such that 
ji = ji + l— 1 for all I < Iq. If lo = k, then the process terminates. Otherwise, 
let L = ji +x — ji — lo and update 

= {jt-L-lil > l } : jl G F®}, = {i,-L-l{i > k} : Ji G iV W }. 

By the construction, it is not hard to see that /o strictly increases as 
t increases, and k remains unchanged. So the process terminates in finite 
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rounds. Let T be the number of rounds when the process terminates, we 
construct {F',N') by 

F' = F^ T \ N' = N^ T \ 

Now, we justify that (F',N') satisfies the requirement in (b). First, it is 
seen that min(F^) U N^) = j on every round t. So min(F U N) = j and 
(F,N) G Cj. Second, on round T, Iq = k, which implies (F' U N') G 1Z p . 
Third, \F^'\ and \N^\ keep unchanged as t increases, so \F'\ = \F\ and 
\N'\ = \N\. Finally, it remains to check ip(F', N') < ip(F,N). It suffices to 
show 

(A.162) V(^ (t+1) ,^V (t+1) ) < 1>(.F®,N®), fori = l,--- ,T-1. 

Let 1 = U and X 1 = F^ U N {t+1 l We observe that G Xl ' Xl = 
G X ' X — Lrjrf, where rj = (0' 1q , l' k _ lo )'. So G x,x — G Xl,Xl is positive semi-definite. 
It follows form (A. 160) that oj(F^ t+1 \ N^) < w(FW,JVW), and hence 
(A.162) holds by recalling that |F(* +1 )| = \F®\ and |A^(' +1 )| = \N®\. 

Third, we prove (c). By assumptions, (FUN) G 1Z P , so that we can write 
F U N = {j, j + 1, ■ ■ ■ ,j + k}, where k + 1 = |F U iV|. The case F e K p 
is trivial. In the case F ^ 7^ p , we construct (F',N') as follows: Let io be 
the smallest index such that io ^ F and both F\ = F Pi {i : i < io} and 
F2 = F\Fi are not empty. We note that such iq exists because F £ 1Z P . Let 

F' = Fi = {i e F : i < i }, N' = {i G N : i < i }- 

To check that (F',N') satisfies the requirement in (c), first note that 
mm(F'uN') = j and hence (F' , N') G Cj. Second, it is easy to see that \F'\ < 
\F\ and \N'\ < \N\. Third, from the definition of io, F' G 1Z P . Additionally, 
since io G N, we have F' U N' = {j,j + 1, • • • , io} G 1Z P . Last, we check 
ip(F',N') < ijj(F,N): Since < \F\ and \N'\ < |jV|, it suffices to show 

(A.163) u(F',N') < lo{F,N). 

Write 1= FUN and denote R = [G 1 ' 1 )" 1 . From (A.161), R is tri-diagonal. 
So R F,F is block-diagonal in the partition F = F\ U F%. Using (A. 160), it is 
easy to see 

u(Fi,X\F{) < u(F,Z\F) = u(F,N). 

At the same time, notice that both X and X' = F' U N' have the form 
{j,j + 1, • • • ,m} with m > max(i ? i) + 1. Applying (A. 160) and (A.161), by 
direct calculations, 



w(Fi,I\Fi) = w(Fi,IVi) = u(F , ,N'). 
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Combining the two parts gives (A. 163). 

Finally, we justify (d). By assumptions, (F U N) G 1Z P and F G 1Z p , so 
that we write F UN = {j, j + 1, • • ■ , k}, and F = {jo, jo + 1, • • • , fco}, where 
jo > j an d &o < k. The case |iV| < 2 is trivial. In the case \N\ > 2, let 
m o = |F| an d we construct (F',N') as follows: 

F' = F, AT' = {A;o + l}, when j = j; 

i ?/ = {j + l,j + 2,--- ,j + m }, N' = {j, j + mo + 1}, when j > j, fc < fc; 
F' = {j + 1, j + 2, • • • , j + m }, iV' = {j}, when j > j, fc = fc. 

Now, we show that (F' , JV') satisfies the requirement in (d). First, by the 
construction, (F',N') G Cj, (F'uN') G K p and F' G 7£ p . Second, \F'\ = \F\, 
\N'\ < 2 < \N\. Third, we check rp(F',N') < ip(F,N). Applying (A. 160) and 
(A. 161), direct calculations yield u(F',N') = u(F,N). This, together with 
< \F\ and \N'\ < \N\, proves ^(F',N') < ip(F, N). □ 

A. 3. Proof of Lemma 5.8. Recalling the definition of C* in the proof 
of Lemma 5.7, the claim reduces to 

min ip(F,N)= min ^^(F, N) + o(l), log(p) < j < p - logfp). 

{F,N)£C* (F,N)&C$ 

We argue that on both sides, the minimum is not attained on (F, AT) such 
that |A| = and \F\ = 1. In this case, on the left hand side, i* 1 = {j} and 
N = 0. By direct calculations, u(F, N) = j > log(p), and hence ip(F, N) can 
not be the minimum. Similarly, on the right hand side, a/ 00 - 1 (F, N) = oo by 
definition, and the same conclusion follows. Therefore, the claim is equivalent 
to 

min ib(F,N)= min ib {oo) (F, AO + oil). 

(F,N)eC*:\F\+\N\>l [F,N)eC^.\F\+\N\>l 

Fix log(p) < j < p — log(p). Define a one-to-one mapping from to C*, 
where given any (F,N) G C*, it is mapped to (Fi,N\) such that 

Fx = {% - j + 1 : i G F}, Ni={i-j + l:ieN}. 

To show the claim, it suffices to show when \F\ + \N\ > 1, 

^(F,N)=^(Fx,Nx) + o(l). 

Since \F\\ = \F\ and | -ZVi | = \N\, it is sufficient to show 



(A.164) 



u(F,N) = w (oo) (Fi,iVi) +o(l). 
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Now, we show (A. 164). Consider the case N ^ first. Suppose \I\ = k 



and write X= FUN = {j, ■ ■ ■ J + k-1}, where 1 < k < 5. Let R = (G 1 ' 1 )^ 1 
and i?* = (El®) Fl,Fl , where £* is defined in (5.110). We note that when 
N 0, R* is invertible. Using (A. 160) and the definition of uj(°°\ 

(A.165) \oj(F,N)-J^(F 1 ,N 1 )\ < max ^[(i^)" 1 - R^. 

Since X G 1Z P , we apply (A. 161) and obtain 

R F,F = l iv F 1){q F 1 y + ^(k) ) F 1 ,F^ 
3 

where 77 = (1, 0, • • • , 0)' G By matrix inverse formula, 

(A.166) e'Ki^r 1 - /.-. 'i = -\j + (v^YKVrHt'KV 1 ) 2 - 

Combining (A.165) and (A.166), 
\u } (F,N)-J 0O \F 1 ,N 1 )\ <3~ X max IC^V 1 ? < 3~ 1 ■ CR/CY- 

?6K fc :|6|<2a 

Since N\ 7^ and /c is finite, A m i n (i?*) > C > and hence H-R^ 1 )! < C. 
Noting that j > log(p), (A. 164) follows directly. 

Next, consider the case N = 0. Suppose \F\ = k and write F = {j, - ■ ■ , j+ 
k — i}, where 1 < k < 3. We observe that G F ' F = jll' + ^* fc \ where f2* is 
defined in (5.111). By definition 

u(F,N)= min £'G F ' F £ = min [jfl'O 2 + ?toi k) £ 

- :\€i\>l £eM fc :|&|>l L 



On one hand, if we let £* be one minimizer in the definition of u)(°°'(Fi, N±), 
then = 0. As a result, 

(A.167) u(F,N) < j{l'C? + (O'dPc = {O'^k* = ^(i^iVi). 
On the other hand, we can show 

(A.168) u(F,N) > w (oo) (Fi,iVi) - l/(j + 1). 

Combing (A.167) and (A.168), and noting that j > log(p), we obtain (A. 164). 

It remains to show (A.168). When k = 2, by direct calculations, u(F, N) = 
J°°\F,N) = 1. When k > 2, write f = (6,6,1')' for any f G M fc , and 
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introduce the function g(x) = YliZi( x i + x i+i + ' ' ' + Xk-2) 2 , for x G M fc_2 . 
We observe that 

(a.169) e / 4 fc) e = (i / e-ei) 2 +5(D- 

Let g'min = mhx cg]K fc_2.| 2 ..|> 1 g{x). We claim that there exists ? £ such 
that 

/ / (k) 

1 q = 0, g fii g = 1 + <?min> and |^| > 1, for 1 < i < k. 

To see this, note that under the constraints > 1, g(x) is obviously 
minimized at x* = (••• ,—1,1,-1,1). Observing that l'(x*) is either or 
1, we let q = (1, -1, (x*)')' when l'(x*) = 0, and let q = (1, -2, (x*)') 1 
when l'(x*) = 1. Using (A.169), it is easy to check that q satisfies the above 
requirements. It follows that 

(A.170) J oo) (F 1 ,N 1 )= min < q'Q^q = 1 + g min . 

£eR fc :|&|>l,i'£=o 

At the same time, since G F,F = jll' + f2* , we can write from (A.169) that 

(A.171) ?g f > f {; = j(i'0 2 + - 6) 2 + 9(0- 

Note that mm y {jy 2 + (y — c) 2 } = c 2 j / (j + 1), for any c G M. So 

i(i'e) 2 + - €i) 2 > i^iI 2 j/(j + 1). 

Plugging this into (A.171), we find that 

(A.172) u(F,N)= min fG^Z > + 1) + g min . 

SeR*:|ei|>i 

Combining (A.170) and (A.172) gives (A.168). □ 

A. 4. Proof of Lemma 5.11. To show the claim, we first introduce 
a key lemma: Fix a linear filter Dh^, for any dimension k > h, let 
be the (k — h) x k matrix, where for each 1 < i < k — h, D^- k \i,i) = 1, 
L>( k \i,i + 1) =r/i, • • • , D^(i,i + h) = and D {k \i,j) = for other j. 
Define the null space of D^ ^ in dimension k, Null^rj), as the collection of 
all vectors £ E M. k that satisfies Z)W£ = 0. The following lemma is proved 
below. 
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Lemma A.l. For a given rj, if RCA holds, then for sufficiently large n 
and any k > n, there exists an orthonormal basis of Nullkij]), denoted as 
,^ h) , such that 

max \6 j) \ 2 <C v n-\ 

l<i<k-n,l<j<h 1 ' 

where C„ > is a constant that only depends on n. 

Second, we state some observations. Fix X<Q + . Partition X pe uniquely as 
X pe = uf =1 Vt, so that Vt = {it, it + 1, • • ■ ,jt~ l,it} is formed by consecutive 
nodes and jt < U+i for all t. Denote M = \J pe \ — \X pe \. It is easy to 
see that T < M and M < h\l\ < l Q h, so both M and T are finite. Let 
Vt = {1 < j < p ■ D(i,j) / for some i £ Vt} and define Null(Vt,Vt) in 
the same way as Null(I pe , J pe ). Recall that J- is the mapping from nodes 
in J pe to their orders in J pe . Similarly, define the mapping Tt from Vt to 
{1, • • • , \Vt\} that maps each j G Vt to its order in %■ Denote It = ^(XnVt). 
We observe that: 

(01) V t n V t ' only when \t - t'\ < 1; and \V t n V t+1 \ <h-l, for all t. 

(02) Null(Vt,Vt) = NulLyArf) for all t, where Nullk(ji) is as in Lemma 
A.l. 

(0 3 ) jve = uf =1 V t ; and \V t \ > \V t \ > 2l pe + 1, for all t. 

(04) Any node i £ It satisfies that 1 < i < \V\ — £ pe , for all t. 

(Ob) For any ^ G Rl^l, ^ G Null(l pe , J pe ) if and only if G Null(V t , V t ) 

for all i, where ^• 7r ( l/t ) is the subvector of £ formed by elements in ^(Vt). 

Due to (O2) and Lemma A.l, for each t, there exists an orthonormal basis 
. . j for Null(V t , V t ) such that 

(A. 173) max \^' j) \ 2 < C v n~ l , for any 1 < n < \V t \. 

l<i<\V t \-n,l<j<h 

Let Ut be the matrix formed by the last h rows of [£ (t>1) ,--- From 
the explicit form of the basis in the proof of Lemma A.l, we further observe: 

(O e ) d < \ m m{UtU' t ) < \ max (U t Ul) < 1 - c, where < c,d < 1 and 
c + d < 1. 

(O7) For each 1 < ho < h, the submatrix of t/f formed by its last /io rows 
has a rank /io- 

Now, we show the claim by constructing a matrix W, whose columns form 
an orthonormal baisis for Null(I pe , J pe ), and it satisfies 

(A.174) max \W(i, j)\ 2 < Cn' 1 , for any 1 < n < \ J pe \. 

l<i<\JP e \-n,l<j<M 
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In fact, once such W is constructed, any U whose columns form an orthon- 
romal baisis for Null(I pe , J pe ) can be written as 

U = WR, 

where R has the dimension M x M and R'R is an identity matrix. By basic 

...,,< 

2 for each 1 < i < m. Applying this to W and 
= 1 and that M is finite, we obtain 



algebra, for any mxn matrix A and nxp matrix B, maxi<j< p | (AB)(i,j) 
n\\B'B\\ ■ maxi< fc < n \A(i, ) 
R, and noting that ||-R'-R| 



(A.175) 



max \U(i,j)r<C max \W(i,j)\ z . 

ief{X),l<j<M ~ i£F(Z),l<j<M 



At the same time, for any i El, there exists a unique t such that i 6 In Vt- 
In addition, from (O4), J~t(i) < \Vt\ — £ pe - By the construction, this implies 
< \ J pe \ - £ pe . Combining this to (A.174), we find that 



(A.176) 



max \W(i,j)\ 2 < C(£ pe )-\ 



The claim then follows from (A.175) and (A.176). 

To construct W, the key is to recursively construct matrices Wt, Wt—i, • • • , W±. 
Denote mt = h — \Vt D Vf+i|, with = h by convention; Mt = 
and L t = | uj =i V^|; in particular, Af x = |.7 pe | - |X^ e | = M and L x 
Initially, construct the Lt x My matrix 



where : 1 < j < fo} is the orthonormal basis in (A. 173). Given Wt+i, 

construct the Lt x Mt matrix Wt as follows: Denote Wt+i the submatrix of 
Wt+i formed by its first \Vt n 14+1 1 (= h — mi) rows and write 



(t,i) 



(t,h) 





"At" 







where has (\Vt\ — h — mt) rows and Bt has (/t — mt) rows. From (O7), the 
rank of Bt is (h — mt). Hence, there exists an h x mt matrix Qt, such that 
QtQt is an identity matrix and BtQt = 0. Now, construct 



(A.177) 



W t 



AtB'tiBtB'^Wt+i A t Qt 



W t+1 
Continue this process until we obtain W\ and let 

w = Wxiw^WiY 112 . 
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Below, we check that W satisfies the requirement. First, we show that the 
columns of W form an orthonormal basis of Null(I pe , J pe ). Since W has 
M = \J pe \ — \X pe \ columns and its columns are orthonormal, it suffices to 
show that all its columns belong to Null(I pe , J pe ). By (O5), we only need 
to show that for each 1 < t < T, in the submatix of W formed by restricting 
rows into J-{Vt), all its columns belong to Null(Vt, Vt). By the construction, 
only the first Mt columns of this submatrix are non-zero and they are equal 
to _ 

A t B' t {BtB[)- x W t+1 A t Q t 

Wt+i 

where in the equality we have used the facts that Wt = B t B' t (B t B' t )~ 1 Wt 
and BtQt = 0. Combining this to the definition of At and Bt, we find that 
each column of the above matrix is a linear combination of {^'^ , • • • , ^'^j 
and hence belongs to NulliVt, Vt). 

Second, we show that W satisfies (A. 174). It suffices to show, for t = 
T ■■■ 1 

(a) mayLi<i< Lt - nt i<j<M t \W t (i,j)\ 2 < Cn^ 1 , for any 1 < n < L t . 

(b) X m in(W[W t ) >C>0. 

In fact, once (a) and (b) are proved, by taking t = 1 and noticing that 
Li = \J pe \, we have max 1 < i <| Jpe |_ n)1 < i < M \W\{i, j)\ 2 < Cra~\ for 1 < 
n < \J pe \; and 1 1 ( V^j' W x ) ~ 1 1| = [Amin^^i)]- 1 < C. Hence, by similar 
arguments in (A. 175), for each 1 < i < \ J pe \ — n, maxi<j<M |W(^,j)| 2 < 
MlK^Wi)- 1 !! • maxi^xAf \Wi{i,j)\ 2 < CrT 1 . This gives (A.174). 

It remains to show (a) and (b). Note that for Wt, by the construction 
and (A. 173), (a) and (b) hold trivially. We aim to show that if (a) and (b) 
hold for Wt+i) then they also hold for Wt- For preparation, we argue that 

(A.178) \\AtB' t {BtB' t )- l Wt + i\\ 2 < C{l pe )~ l = oil). 

To see this, note that Lt + \ > 2£ pe + 1 from (O3); in particular, h — mt <C 
L t+ i-£ pe . Hence^if (a) holds for W t+ i, maxKia.^^^M^ | W t +i(i, j)\ 2 < 
C^ 6 )" 1 , i.e., \W t +i(i,j)\j^ C(£ pe )-\ for any Since W t+ i has a finite 

dimension, this yields ||Wt+i|| 2 < C{l pe )~ l . Furthermore, from (Oq) and 
that B t B' t is a submatrix of U t U' t , X mm (B t B' t ) > d > 0. So IKBtB't)- 1 ]] < C. 
In addition, \\At\\, \\Bt\\ < 1. Combining the above gives (A.178). 

Consider (a) first. By (A. 177), (A.178) and the assumption on W^+i, it 
suffices to show 

(A. 179) max \A t Q t (i, j)\ 2 < CrT 1 , for any 1 < n < | V t \. 

l<i<\Vt\-n,l<j<m t 



At 




B t 





B' t (B t B' t )- l W t , Q t 
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By similar arguments in (A. 175) and the fact that ||QtQt|| = 1, the left hand 
side is bounded by C max i<j<|vi|-n,i<i<m t \ A t(i,j)\ 2 - Therefore, (A. 179) fol- 
lows from (A. 173) and the definition of At. 

Next, consider (b). Using (A. 177) and (A. 178), we can write 



WlWt 



w; +1 w t+1 + Ax A 2 

A' 2 Q' t A>A t Q t 



where ||Ai|| = o(l) and HA2II = o(l). So it suffices to show A m i n (W/ Wf+i) > 
C and \mm(Q'tA' t AtQt) > C. The former follows from the assumption on 
Wt+\- To show the latter, note that Q' t Qt is an identity matrix, and so 
^mm(Q'tA' t AtQt) > X m m(A' t A t ). Also, since A' t A t + B' t B t is an identity ma- 
trix, A min (A^ t ) = l-X maK (B' t B t ). Additionally, X max (B' t B t ) = X maK (B t B' t ), 
where B t B' t is a submatrix of UtU' t , and by (Oq), A max ({7tJ7 t ') < 1 — c. Com- 
bining the above yields X m i n (Q' t A' t A t Qt) > c > 0. This proves (b). □ 

A. 4.1. Proof of Lemma A.l. For each k > h, we construct a k x h matrix 
U whose columns form an orthonormal basis of Nullk{rj) as follows: Recall 
the characteristic polynomial ^(z) = 1 + rjiz + • • • + rjf l z h . Let z\, ■ ■ ■ ,z m 
be m different roots of tp n (z), each replicating hi,--' , h m times respectively 
(hi + • • • + h m = h). For 1 < j < m and 1 < s < hj, when Zi is a real root, 
let 

,.{j,s) _ I js-l 1 ... QS-lJ_ OS 




and when Zj-t = |zj|e =tv ^ e -' , #j G (0, vr/2], are a pair of conjugate roots, let 



z 



iy-.>i / , ,-i Sin(fe- l)gj- 



3 s - 


_2 cos 29 j 


2 s - 


_! COS #j 




\z-\ 2 






3 s - 


1 sin 29 j 


2 s - 


x sin 6*j 

1 1 ' 




\z-\ 2 ' 







It is seen that {/j,V> s \l < J < m, 1 < s < /ij} are /i vectors in Let 
£C?> S ) = /iW' s )/||yuw' s )|| for each (j,s), and construct the k x h matrix 

Define 

U = R{R'Ry 1/2 . 

Now, we show that the vectors {fj,v> s \ 1 < j < m, 1 < s < /ij} are linearly 
independent and span Nullk(rj). Therefore, U is well defined and its columns 
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form an orthonormal basis of Nullk{rf). To see this, note that for any vector 
77 £ if we write rj\ = f(k), • • • , rjk = /(l), then £ 6 Nullkirj) if and only 
if f(i)'s satisfy the difference equation: 

(A.180) /(i) + 171/(1 - 1) + • • • + 7] h f(i - h) = 0, h + l<i<k. 

It is well-known in theories of difference equations that (A.180) has h inde- 
pendent base solutions: 

fj,,{i)=i'~ 1 zj i , l<j<m, \<s<h y 

By the construction, when Zj is a real root, //( J,s ) = (fj, s (k), • • • , /j, s (l))'; and 
when Zj± are a pair of conjugate roots, and fx^~' s ' are the real and 

imaginary parts of the vector {fj, s {k),--- , / J)S (1))'. So the vectors 
are linearly independent and they span Nullk(rj). 

Next, we check that the columns of U satisfy the requirement in the claim, 
i.e., there exists a constant such that for any (n, k) satisfying k > n > h, 

max \U(i,j)\ 2 < Cnn- 1 . 

Since maxi<j<^ \U(i, j)\ < h\\ (R'R)" 1 \\ ■ maxi<j</j \R(i, j)\ 2 , it suffices to 
show that 

(A.181) max \R(i, j)\ 2 < CrT 1 , 

l<i<k-n,l<j<h '•'/I 

and that for all k > h, 

(A. 182) X min (R'R) >C>0. 

Consider (A.181) first. It is equivalent to show that 



(A.183) max Ifi^ \/\\^' a) \\ < Cn- 1/2 , 1 < j < m, 1 < s < hj. 

Ki<k—n 



In the case \zj\ > 1, < C. In addition, \zj\ l > Ci s l / 2 for sufficiently 

large i, and hence m.aKx<i<k-n\^i \ < maxj >n C{i s ~ l i l / 2 ~ s ) < Cn^ 1 / 2 . 
So (A.183) holds. In the case \zj\ = 1, it can be shown in analysis that 
> Ck s ~ 1 / 2 , where C > is a constant depending on 9j but inde- 
pendent of k. Also, maxi<j<fc_ n ^| < niax n<i <fc Ci s ~ l < Ck s ~ l . Hence, 
maxi<i< fc _ n ImPVII/^II < Ck- 1 / 2 < Cn~ l l 2 and (A.183) holds. 
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Next, consider (A. 182). R'R is an h x h matrix. For convenience, we use 
{(j, s) : 1 < j < m, 1 < s < hj} to index the entries in R'R. By construction, 
all the diagonals of R'R are equal to 1, and the off-diagonals are equal to 

(A.184) ^ R ) Mt ^ = ^J^^. i (j,s)^(f,s f ). 

It is easy to see that as k — > oo, each entry of R'R has a finite limit. 
Therefore, as k — > oo, R'R approaches a fixed h x h matrix A element-wise. 
In particular, A m in (R'R) — »■ A m i n (j4). Hence, to show (A. 182), we only need 
to prove that A is non-singular. 

Write R = (R\,R2), where R\ is the submatrix formed by columns 
corresponding to those roots \zj\ > 1, and R2 the submatrix formed by 
columns corresponding to those roots \zj\ = 1. Note that when \zj\ = 1 and 
\z f \ > 1, as k^ 00, |(/^' s ),/^'> s '))| < C, \\n ij ' s) \\ -»■ 00 and > C; 

so (i?' R)(j lS ),(j' .s') ~^ 0- This means R[R2 approaches the zero matrix as 
k — > 00. Consequently, 

A = diag(Ai, A2), where R[Ri — > A\ and R' 2 R2 — > A2, as k — > 00. 

Therefore, it suffices to show that both Ai and A2 are non-singular. 

Consider A\ first. Denote ho = > 1} so that i?i is a x /i 

matrix. Let i?^ be the k x ho matrix whose columns are {// J ' s ) : \zj\ > 1}, 
M be the /io x ho submatrix formed by the last ho rows of R* and A = 
diag([|/i^' ,s ' ||) is the ho x ho diagonal matrix. Now, suppose A\ is singular, i.e., 
there exists a non-zero vector b such that b' A\b = 0. This implies — > 
as k — > 00. Using the matrices defined above, we can write R\ = -RJA; so 
||il*A6|| -»■ 0. Since ||MA6[| < ||i^A6||, it further implies ||MA6|| -> 0. First, 
we observe that M is a fixed matrix independent of k. Second, note that 
when \zj\ > 1, ||/x^' s ^|| — > Cj s , as k — > 00, for some constant Cj s > 0; as a 
result, A — > A* as k — > 00, where A* is a positive definite diagonal matrix. 
Combining the two parts, ||MA6|| -> implies ||M(A*6)|| = 0, where A*b 
is a fixed non-zero vector. This means M is singular. Therefore, if we can 
prove M is non-singular, then by contradiction, A\ is also non-singular. 

Now, we show M is non-singular. Let M be the matrix by re-arranging 
the rows in M in the inverse order. It is easy to see that M is non-singular if 
and only if M is non-singular. For convenience, we use {1, • • • , ho} x {(j, s) : 
\zj\ > 1, 1 < s < hj} to index the entries in M. It follows by the construction 
that 

A^i,(j» = i s ~ lz J ^~ j 2j is a real, 1 < i < /io 

~ A 1 r ^ )) ; , are conjugates, 1 < 1 < h . 
M ii0 _ jS) = i- 1 |z j r ( ^ 1) sm((i-l)^), ' 
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Define an ho x ho matrix T by 

3i i(j - a) =i 8 - 1 z7( i - 1 ) J l<i<h . 

Let V be the ho x /io confluent Vandermonde matrices generated by {zj 1 : 
\zj\ > 1}: 

[0 1 < i < s- 1, 

First, it is seen that each column of T is a (complex) linear combination of 
columns in M. Second, we argue that each column of V is a linear combi- 
nation of columns in T. To see this, note that Viu jS \ can be written in the 

form Vi^ s ) = g s -!(i)zj^ s \ where g s -i{x) = (x- l)(x-2) ■ ■ ■ (x-s + 1) is 
a polynomial of degree s — 1. Let Co, • • ■ , c s _i be the coefficients of this poly- 
nomial. Then, for each i > s, = z~^~ s ^ YliZo = Yli=i a i T i,(j,l), 
where ai = The argument follows. Finally, it is well known that 

det(V) 7^ 0. Combining these, we see that det(M) / 0. Therefore, M is 
non-singular. 

Next, we show A2 is non-singular. Note that Yli=i ^ = 7pi(l + 
£* =1 *W((z-l)0) = 2*^(1+0(1)) and £* =1 i»sin 2 ((i-l)0) = ^y(l+ 
o(l)), for 6 / -f ,0, f . Also, £*Li sin((z - i)(9) = (A; S + 1 ) for all 0, and 
Yli=i i s cos((« — 1)6*) = o{k s+l ) for 0/0. Using these arguments and basic 
equalities in trigonometric functions, we have 

f y / (2s-l)(2y-l) _ , 

sUj ,, s/) = 0(1) + 5+F=I > J "J. 

[ 0, elsewhere. 

As a result, is a block-diagonal matrix, where each block corresponds to 
one Zj on the unit circle and is equal to the matrix W(hj), where hj is the 
replication number of Zj and W(h)(s, s') = (2s — l)(2s' — 1)/ (s + s' — 1), 
for 1 < s, s' < h. Since such W(h)'s are non-singular, A2 is non-singular. 

□ 
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